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FACTORS INFLUENCING TEST RELIABILITY 
PERCIVAL M. SYMONDS 


Teachers College, Columbia University 


This paper proposes to list and discuss the factors which influence 
the reliability of tests. Were psychologists more conscious of what it 
is that makes a test reliable, fewer blunders would be made in devising 
tests which have low and unsatisfactory reliability. The development 
of the natural sciences depended on the development of exact measure- 
ments, and the development of psychology as a science likewise 


Lest / lest 2 
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general rield 


depends on the perfection of its measuring instruments. Much of the 
recent work in the development of tests, particularly in the measure- 
ment of personality, is practically worthless because the tests do not 
tell a consistent story. 

Reliability in this paper is defined as the correlation between two 
comparable tests. If a test is split so that one half contains items 
1, 3, 5, 7, etc., and the other half items 2, 4, 6, 8, etc., these two halves 
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constitute in themselves comparable tests. Any two comparable 
tests may be thought of as being the split halves of a test double the 
length of either. If the test is an objective new-type test containing 
homogeneous material, corresponding items of comparable tests 
resemble each other in form, significance, difficulty, etc. If the test is 
the traditional essay or problem type of examination, corresponding 
items of comparable tests may have little resemblance to one another 
in form, significance or difficulty. 

To use Kelley’s! terminology and the above diagram, let the two 
upper circles represent the field measured by two comparable tests 1 
and 2 and let the lower circle represent the field that the tests are 
intended to measure such as intelligence, reading, algebra, or French. 


Leta = a factor common to test 1 and test 2, and to the field that 
the test intends to measure. 
b, = a factor common to test 1 and the general field but not 
to test 2. 
be = a factor common to test 2 and the general field but not 
to test 1. 
c = a factor common to test 1 and test 2, but not to the 
general field. 
d, = a factor unique to test 1, but not chosen. 
dz = a factor unique to test 2, but not chosen. 
é, = a chance factor found in test 1. 
€2 = a chance factor found in test 2. 


Kelley defines the reliability of test 1 as 


o2 + 0? 


oc” 





The validity of test 1 is 
o% +o; 


o” 


The problem of this paper is to isolate factors a and c and distinguish 
them from factors d and e. 

It is customary to group the factors influencing test reliability into: 
(1) Factors in the construction of the tests themselves and (2) factors 
in the variability of the individuals taking the’tests. For certain 
factors this is a clear cut distinction; for others both irregularity in 








1 Kelley, T. L.: Note on the Reliability of a Test: A Reply to Dr. Crum’s 
Criticism. Journal of Educational Psychology, Vol. XV, April, 1924, pp. 193-204. 
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test construction and the variability in individuals seems to be opera- 
tive. Factors concerned with the construction of tests may be divided 
into: (a) General factors such as the influence of directions, objectivity 
of scoring, character of printing and (b) character of specific items such 
as the affective tinge of items or catch questions. Likewise the 
variability of individuals may be divided into: (c) The general condi- 
tions of the individual such as excitement or nervousness and (d) 
specific methods of attack on the test such as speed, or accuracy. 

1. The most important single factor influencing test reliability is 
the number of test items. That is, the greater the number of the items in 
a test, the more reliable the test. The evidence for this is both deductive 
and experimental. It can be argued from the nature of the correlation 
of sums or averages that an increase in the number of items in a test 
(provided the test retains identity in comparability with the original 
test) increases the reliability. This increase in reliability has a mathe- 
matical relationship as given by the Spearman-Brown formula.’ 


r pal arin 
hat "14+ @— Dru 


The reliability of tests would increase exactly as predicted by the 
Spearman-Brown formula if the longer tests were exactly comparable 
to the shorter tests. Since this is never actually true there is deviation 
from the Spearman-Brown prophecy in actual practice. The extent 
to which actual data fit the Spearman-Brown formula has been the 
subject of much experimentation and discussion.’ 








1 Kelley, T. L.: “Statistical Method.” P. 205. 

Spearman, C.: Correlation Calculation with Faulty Data. British Journal 
of Psychology, Vol. III, Oct., 1910, pp. 271-295. 

Brown, W.: Some Experimental Results in the Correlation of Mental Ability. 
British Journal of Psychology, Vol. III, Oct., 1910, pp. 296-322. 

2 Chu, J. P.: “‘Chinese Students in America: Qualities Associated with Their 
Success.” Teachers College Contributions to Education No. 127. 

Crum, W. L.: Note on the Reliability of a Test, with Special Reference to the 
Examinations Set by the College Entrance Board. The Mathematical Monthly, 
Vol. XXX, Sept., Oct., 1923. 

Holzinger, K. J.: Note on the Use of Spearman’s Prophecy Formula for 
Reliability. Journal of Educational Psychology, Vol. XIV, May, 1923, pp. 302-305. 

Holzinger, K. J. and Clayton, B.: Further Experiments in the Application 
of Spearman’s Prophecy Formula. Journal of Educational Psychology, Vol. XVI, 
May, 1925, pp. 289-290. 

Kelley, T. L.: The Applicability of the Spearman-Brown Formula for the 
Measurement of Reliability. Journal of Educational Psychology, Vol. XVI, 
1925, pp. 300-303. 
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Holzinger, who was earlier skeptical of the law that length of a test 
is a factor in reliability which can be expressed mathematically by the 
Spearman-Brown formula, (“‘This would lead one to expect that by 
continual lengthening we could approach perfect reliability as closely 
as we please. Experience with tests and children, however, shows 
that this is absurd.’’) was later brought to say as the result of his experi- 
mental researches, ‘‘The Spearman-Brown Law may fail to give a 
satisfactory prediction with unhomogeneous, unevenly graduated 
test material when the formula is based on the first observed coefficient 
Tz or the average of several. Wzth accurately calibrated test material 
the above law has been shown to give an excellent basis for prediction.”’ 

Any factor which apparently tends to make a test have a greater 
or smaller number of items or which is correlated with number of 
items is a factor in test reliability. Among these factors may be 
mentioned: 

2. Other things being equal the longer the time a test occuptes the 
greater its reliability. This is true as a general law merely because 
length of time of a test is positively correlated with number of items. 

3. Other things being equal the narrower the range of difficulty of the 
items of a test the greater the reliability. If an item is so hard that no 
one in the group answers it, that item may be omitted without chang- 
ing the score of any individual taking the test. Consequently it has 
no influence upon the reliability of the test and really makes the test 
equivalent to a test having one item less. Likewise a test including 
items so easy that everyone in a group answers them correlates per- 
fectly with the same test minus those easy items. Hence those items 
add nothing to the value of the test. That item which has the greatest 
influence on the reliability of a test is one answered by 50 per cent of 
the group taking the test. 





Kelley, T. L.: Note’ on the Reliability of a Test. A Reply to Dr. Crum’s 
Criticism. Journal of Educational Psychology, Vol. XV, pp. 194-204. 

Slocombe, C. 8.: The Spearman Prophecy Formula. Journal of Educational 
Psychology, Vol. XVIII, Feb., 1927, pp. 125, 126. 

Slocombe, C. S.: A Further Note on the Use of the Spearman Prophecy 
Formula: A Correction. Journal of Educational Psychology, Vol. XVIII, May, 
1927, pp. 347, 348. 

Ruch, G. M., Ackerson, L., and Jackson, J. D.: An Empirical Study of the 
Spearman-Brown Formula as Applied to Educational Test Material. Journal of 
Educational Psychology, Vol. XVII, May, 1926, pp. 309-313. 

Wood, B. D.: Studies of Achievement Tests, Part III. Journal of Edu- 
cational Psychology, Vol. XVII, April, 1926, pp. 263-269. 
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This factor affecting reliability was studied experimentally by the 
writer. Following is a brief discription of the experimental work. 
Three sets of spelling lists were given to fifth grade children. List 1 
consisted of 100 words, 4 words being taken from each of the columns 
H to AF on the Ayres Spelling Scale. List 2 consisted of 100 words, 
2 words from column L, 8 each from columns M to X and 2 from col- 
umn Y. List 3 consisted of 100 words, 50 words each from columns 
Q and R. Each list was broken up in the statistical treatment into 
two comparable lists of 50 words each. 


Let the following numbers represent the code for identifying the 
tests, each of 50 words 


























; I 
Lis 
t 1 ll 
' III 
List 2 IV 
; V 
Li 
st 3 VI 
The intercorrelations are 
I II m | Iv | Vv VI 
| 
Be tat ie 877 || .827| .848 || .833| .775 
II 0s 5. 857 886 || .869| .783 
III 827 . 2 eee 899 || .874 .810 
IV $48, 886 || 899 | ...... | .904| .842 
V 833 | .869 || .874 904 |] ...... .862 
VI 775 | .783 || .810| .842 || 862 
ME Desires 832 | .854|/ .853| .876 || .868| .814 
AVREGOS o 066s. e .843 } .864 } .841 
M 20.48 | 19.54 || 24.39 | 21.70 || 26.74 | 32.42 
o 5.86 | 5.88 || 9.10 | 9.22 | 11.12 | 11.82 




















These figures indicate that the reliability coefficients of all these 
groups of tests are practically identical—a test of 50 items with a 
wide range of difficulty has the same reliability coefficient as a test of 
50 items with a narrow range of difficulty. But the error of placement 
of the test with the narrow range of difficulty is much less. For exam- 


ple, the standard error of measurement (o:+/1 — rj) for each group 
of tests 1, 2 and 3 is as follows: 


Test Group 


ovVil—ril 
1 2.33 
2 3.38 
3 4.58 
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But in the words of group 1, one word which corresponds to one 
point score is the equivalent of 14 of a step on the Ayres Scale. In 
group 2, one word or one point score is equivalent of }¢ of a step on 
the Ayres Scale. In group 3 one word or one point score is equivalent 
to éo of a step on the Ayres Scale. Translating the standard error 
of estimate into steps on the Ayres Scale we have. 


Test Group V1 — ra 
1 .58 
2 .42 
3 09 


in terms of steps on Ayres Scale. 
That is, the test with the narrowest range of difficulty shows an 


_ error of measurement considerably less than a test with a wider range 


of difficulty. The illustration used with the Ayres Scale blurred the 
actual phenomenon. As matters stood the words in test group 3 
selected from columns Q and R of the Ayres Scale did not actually 
represent as narrow a range of difficulty for this particular group as 
the scale itself indicates. To make this phenomenon more decisive 
the difficulty values of the words in the spelling tests should have been 
determined on the group which took the tests. If that had been done 
and if individuals showed a uniformity of experience the list of words 
of uniform difficulty would have failed to measure the group at all. 
Most of the pupils would have made zero or perfect scores. The 
situation is the same as trying to use a clinical thermometer which 
registers only a narrow range to measure all temperatures. 

4. Evenness in scaling is a factor influencing the reliability of a test. 
Other things being equal a test evenly scaled is more reliable than a 
test that has gaps in the scale of difficulty of its items. Bunching 
items together in difficulty has the same effect on the reliability of a 
test as lowering the number of items. For instance, if an extreme case 
is taken such that items are divided into two groups, the items in one 
group being passed by the majority of pupils in a class and the items 
in the other group being failed by the majority of pupils in a class, the 
test is reduced to little better than a test of two items. Bunching of 
items in difficulty with consequent gaps in the scale tends to lower test 
reliability in much the same way that a hundred foot tape with divi- 
sions marked only in feet, or a spring balance with divisions marked 
only in pounds yields unreliable measurement. 

Usually this bunching of items in difficulty follows no law of 
regularity and hence cannot have a mathematical formulation. If the 
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grouping takes place systematically Sheppard’s correction for the 
standard deviation may be applied in estimating loss of reliability 
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5. Other things being equal interdependent items tend to decrease the 
reliability of a test. If the answer to one item is suggested in another 
item, or if the meaning of one item depends upon a previous item, these 
items act to lower the reliability. For the tendency becomes to answer 
neither item or both items and thereby produces an effect equivalent 
to reducing the number of items in a test. Asking several questions 
on one paragraph in a reading test comes under this head, for if a pupil 
fails to understand the paragraph he has difficulty with all the items 
on that paragraph. 

6. The more objective the scoring of a test the more reliable is the test. 
One factor which may influence the variability of test scores is the 
uniqueness of the answers which are given credit. If a test is perfectly 
objective, i.e., if answers which which are given credit are sharply 
defined in a key and only those answers are given credit, this factor 
influencing reliability is eliminated. But where judgment of the 
scorer enters in determining the acceptability or fitness of an answer, 
as in the verbal completion test, there is a factor causing test 
unreliability. 

7. As a corollary to the last point scoring inaccuracy is a factor in 
test reliability. This factor is eliminated with accurate scoring. But 
errors in scoring give rise to a variation in score which lowers the 
reliability of the test. 

8. Chance in getting the correct answer to an item is a factor in test 
reliability. Some of the most objective forms of tests offer the most 
opportunity for chance to influence the score. The true-false test is a 
type in which chance plays a maximum part in determining the score. 
In the single answer test and in subjective tests chance plays a negli- 
gible part in test reliability because the ratio of the one correct answer 
to the multitude of possible answers issosmall. In thecase of multiple- 
response tests the influence of chance in determining the correctness of 


any item is 





Ti: = 








7] Where n is the number of alternatives provided. A 


skilful test maker can lower this ratio by including misleading associa- 
tions among the various alternatives. There has been much specula- 
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tion as to the influence of chance in lowering test reliability. It is 
especially important to know the relative influence of chance in lower- 
ing reliability as against objectivity in raising it. This has been 
answered by Ruch! whose results are given in the following table of 
reliabilitiés (Form A—Form B) for comparable tests of 100 items. 


COEFFICIENT OF 


Type or TEstT RELIABILITY 
Tin Rett Bier a PRD xin RS SUE OI: a Ale . 950 
7-response multiple choice.............. cc cs ceccccsceceee .907 
5-response multiple choice................ 2. ce cece cee ences . 882 
3-response multiple choice................cccceccecscecens .890 
2-response mititinio Choice... .. 2.2... ceca cccscececccsvess .843 

Cobia tit iii thee hnoked 6a apersesembaeaie . 837 


This table enables one to estimate the loss in reliability which is 
due to chance. 

Any factor that causes chance to play a part in determining whether 
or not an item is to be answered is also a factor that influences test 
reliability. 

9. In a recognition test the position of the correct item among the 
alternatives influences the reliability. Mathews,? in demonstrating this 
fact, found that there is a tendency to mark the left rather than the 
right of two indicated responses or the upper rather than the lower of 
a pair of response words. “This influence is greatest where guessing is 
greatest. In such a test as the Yes and No the child who knows least, 
probably tends to guess most, at least if definite instructions against 
guessing are not given. Therefore, in two-response tests with equal 
numbers of alternate responses the influence of position of response 
and the error. this would make in a scoré based upon rights minus 
wrongs, favors the good student who is influenced only a little by the 
position of the printed responses.” 

10. Other things being equal the more homogeneous the material 
of a test the greater its reliability. The reason for this may be seen by 
referring to the diagram on page 73. If the items of a test are hetero- 
geneous in subject-matter, or if the factors b, d ande are more numerous 
than factors a and c, by definition the reliability becomes less. If a 
test maker purposely includes items of diverse character in a test in 





1Ruch, G. M.: “Objective Examination Methods in the Social Studies.” 
P. 78. 

2 Mathews, C. O.: The Effect of Position of Printed Response Words upon 
Children’s Answers to Questions in Two-response Types of Tests. The Journal 
of Educational Psychology, Vol. XVIII, Oct., 1927, pp. 445-457. 
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order to sample different phases of the function being measured, he 
does so at the expense of reliability as the increased heterogeneity of 
the material only works to lower the reliability. 

11. Among chance factors may be mentioned the commonness or 
uniqueness of the experiences in the test. Other things being equal the 
more common the experiences called for in a test are to the members of 
the group taking the test the more reliable the test. For this reason tests 
of things learned in school are more reliable than things learned outside 
of school. In general, tests of conduct or character will always be 
less reliable than tests of ability in which there is universal agreement 
as to rightness or correctness. Tests using material taken from the 
common environment are the more reliable. A good example of this 
is found in the Stanford Achievement Test. The test in language 
usage, even though it contains more items than test 1, reading of 
paragraphs; test 2, arithmetic computation; test 3, and arithmetic 
reasoning; has a lower reliability coefficient than these other tests. 
Part of this may be due to the form, for since each item of the language 
usage test is a two-alternative item, chance may play a considerable 
part in the score. But part of the lower reliability may be due to 
the fact that pupils learn their language habits mainly at home where 
each home has its own standard of correctness. This results in a 
subjectivity in the taking of the test, and a subjectivity of the responses 
considered correct by the tester rather than by the scorer. On the 
other hand, tests on subjects learned at school have a common standard 
of correctness. Words have a single meaning, and arithmetic prob- 
lems have but one answer. 

12. Variations of this factor of commonness or uniqueness of 
material occur frequently. Other things being equal, the same test given 
late in the school year is more reliable than when given early in the year. 
Given early in the year much of the material of the test will not have 
been formally covered in class. Whether or not a pupil answers many 
items in the Powers General Chemistry Test at the opening of the 
school year will depend on such factors as whether he has read chemis- 
try outside of class, or whether he has had a chemistry experimental 
set at home, or whether his family discusses matter of intellectual 
interest at the table. These are chance factors and influence test 
Scores in a chance way. 

13. Another factor similar to the last is the inclusion of extraneous 
or dead material in a test. If a test contains material not discussed in 
class or not given in the textbook, that test is less reliable than a test 
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having the same number of items all of which are relevant. Such a 
test may be considered as the equivalent of a shorter test with a 
smaller number of items, with chance determining the answers to 
the extraneous or dead material. For this reason standardized tests 
in a subject are probably less reliable than comparable tests, equivalent 
in form and length, but containing only material relevant to the 
course as given. 

14. Other things being equal, catch questions in a test lower the reli- 
ability of the test. A test answered by the systematic recall or recog- 
nition of orderly facts or experience is more reliable than a test answered 
by sudden insight because of novelty. Questions which must be 
answered by sudden insight tend to lower the reliability of the test. 
Thorndike! has noted the incidence of this factor. He says: ‘“‘The 
equalization of environmental influence obtained by novelty in and of 
itself has one notable practical disadvantage. Special coaching 
for the test is likely to produce many great inequalities in favor of 
those who receive it.”” The most reliable tests are those in which 
special coaching has the least influence. 

15. Subtle factors in a test item which tend to be misinterpreted 
or over or under emphasized help make the test in which the item is 
included unreliable. Such factors are: 

(a) The Emotional Tinge of Words in Items.—If words are included 
in an item which cause the item to be misinterpreted or which lend 
false clues or associations, that item is a factor of unreliability in the 
test in which it is included. 

(b) Length of a Test Item.—The longer a test item the more chance 
there is that it will be misinterpreted or that certain factors in the 
item will be over or under estimated. Items which require extensive 
reading tend to be less reliable than items which require little reading. 

(c) Choice of Words.—If strange or unusual words are used, or if 
words are used with unusual or technical meanings they tend to 
increase the unreliability of an item. Any item which contains trade 
secrets tends to be less reliable than an item in which all terms are 
used with their ordinary connotation. 

(d) Poor sentence structure, particularly an unusual order of words 
tends to lead to misinterpretation of an item and is a factor in 
unreliability. 

(e) Inadequate or faulty directions in a test or the failure to provide 
suitable illustrations of the task tend to lead to test unreliability. 





1 Thorndike, E. L.: ‘‘The Measurement of Intelligence.” P. 438. 
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(f) Any factor which makes one misinterpret the intention in a test 
item tends to make that item and hence the whole test unreliable. 
Matters of printing, spacing, paragraphing, etc., are all potent in this 
connection. Ifa term or phrase is split so that it occupies two lines or 
if variations in type are used so that certain parts of the test are made 
to stand out or others are diminished the test is liable to be misinter- 
preted and become unreliable. 

The factors to be discussed next are those which have to do with 
the variability of the individuals being tested. 

16. It has been shown that the speed of taking a test is a factor in 
the reliability of the test. Individuals may vary in the speed with 
which they take a test. At one time they may work more slowly 
than at another time. Part of this is due to the matter of getting 
adjusted to the taking of the test. One has to learn how to work the 
test as well as the requirements of the exercises themselves. Pupils 
differ in the speed with which they adjust themselves to the taking of 
a test. This is due partly to general mental agility and partly to 
experience in taking tests. Practice and experience with tests, partic- 
ularly with the mechanics of taking the test, helps to diminish the 
unreliability of tests. A fore-exercise to a test which pupils may 
experiment with before taking the test itself helps to stabilize this 
factor of speed. In this connection the accuracy with which a test is 
timed is an important factor in test reliability. 

17. Accuracy in taking a test is an important factor influencing 
reliability. A pupil will vary at different times in his accuracy on a 
test. This may be due to the set which he is given by the directions 
in the test. It is often due to the fact that before a pupil understands 
what the test requires he will proceed with less accuracy than later 
when he understands the nature of the test exercises. Part of this 
factor is due to the way in which we teach pupils to interpret test 
results. Our insistence on speed and the length of the test leads a 
pupil to believe that he is expected to cover as much ground as pos- 
sible, regardless of the accuracy of his work. That this is so has been 
demonstrated many times from results of standardized tests.2 The 
fact that the accuracy with which items on a test are answered is a 


factor in reliability was evidenced by the study of Symonds above 
referred to. 


1Symonds, P. M.: A Study of Extreme Cases of Unreliability. Journal of 
Educational Psychology, Vol. XV, Feb., 1924, pp. 99-106. 
* See Thorndike, E. L.: “Psychology of Algebra.” Chap. 12. 
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18. Incentive or Effort.—Differences in incentive and effort tend to 
make tests unreliable. The appeal of a test is stronger with some 
pupils than with others; and is stronger with a pupil at one time than 


‘at another. It is commonly assumed that tests have a uniform and a 


maximum appeal, but this must be far from the case. When one 
comes to character or personality tests this factor is greatly magnified. 
Such tests assume that the pupil is being impelled by the same forces 
of interest and purpose. In the case of achievement tests a strong 
motive is thrown into the field, like a magnet, and as in magnetization 
we assume that all of the molecules align themselves in the direction of 
polarity, so we assume that the motive of the test is equally effective 
on all pupils taking the test. But in a character or personality test we 
cannot even assume a uniform motivation. Probably this is as potent 
as any other factor in causing the unreliability of tests of character. 

19. An unknown, but probably powerful, factor determining 
unreliability is the obtrusion of competing ideas. This perseveration 
of previous experience is a factor that must be reckoned with. Chil- 
dren bring to school perseverating experiences from the movies, family 
life, happenings on the street, playground and locker room. If a 
pupil is steeped in the sentimentality of a movie, or worried because of 
friction at home, or afraid because of a bully who promised to get him 
after school, his mental mechanism is surely less able to stick at the 
manipulations required on a test than if his mind is freed from such 
extraneous factors. Pupils probably differ in their ability to concen- 
trate. Pupils differ, also, in the number or intensity of outside dis- 
tracting influences which they encounter. And any one pupil will be 
more dominated by the compelling idea on some occasions than on 
others. : 

20. Following closely on this last point is the matter of distractions 
during the test itself. Any incident that occurs in the schoolroom 
during the taking of a test influences to some degree the taking of the 
test. If a test is given while noisy pupils are having recess on the 
playground under the window or in the school corridors or if a test is 
given at the end of the hour when the pupil is momentarily expecting 
the bell for dismissal, the conditions are unfavorable for the best 
results. Distracting incidents in the schoolroom also ought to be 
avoided. Pupils should not be allowed to leave their seats during a 
test. No questions from pupils should be permitted after the test is 
started. Directions should be given concerning what to do if the test 
is finished ahead of time. Under no conditions should pupils be 
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allowed to leave the examination room early if confusion is to be the 
result. 

21. Accidents occurring during the examination, such as breaking a 
pencil, running out of ink, or defective test booklets, influence the 
reliability of the test. So far as possible accidents should be forseen— 
prevented if possible, quickly remedied at any event. Extra pencils 
should be at hand and the examiner should be on watch to supply one 
in case of need. Fresh examination papers should be noiselessly dis- 
tributed wherever it is evident that the first copy is imperfect. 

22. Illness, worry, excitement, probably are minor factors in test 
reliability. The pupil who has sprained a wrist so that he cannot 
write with his writing hand is at an obvious disadvantage. Likewise 
if the pupil is working with a splitting headache or high fever the 
average results should not be expected. The human machine can 
submit to marked variations in physical efficiency however, with no 
marked change in mental efficiency. In general this factor of general 
condition of the individual has been much over emphasized in consider- 
ations of test reliability. Most persons believe that excitement, 
worry, and variations in physical efficiency markedly influence test 
results. Many teachers would entirely discard test results as measures 
of achievement because they believe that pupils are unable to do them- 
selves justice on an examination. This superstition probably may be 
traced to one’s own experience in taking tests and the rationalization 
that would excuse a test result on the basis of excitement, worry or 
nervousness. 

On the other hand experimentation shows that the general condi- 
tion of an individual is of relative unimportance in influencing the 
results of a test. The study by Symonds already referred to found 
that extreme cases of unreliability were practically never due to the 
general conditions of the individual. A number of experiments have 
been conducted studying various phases of work and efficiency. The 
general conclusion is that mental work has a remarkable consistency 
even during or following a variety of distracting influences. Con- 
tinuous mental work or fatigue has been found to have little effect on 
the subsequent efficiency. Loss of sleep, fasting, atmospheric effects 
all seem to produce no immediate effect on the capacity to do mental 
work. Concerning this Gates says, ‘‘Such facts bear witness to the 
remarkable stability of the mechanisms involved in well habituated 
mental activities. It is surprising that those functions, which may 
be so readily allowed to operate below maximum in the absence of 
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incentives, remain unimpaired in efficiency, during and after such 
extreme deprivations and exertions. The facts attest, also, to the 
remarkably effective and facile adaptability of the human organism 
to unfavorable conditions imposed upon it.” In general, therefore, 
distractions, on general conditions of-the individual, are unimportant 
as factors in test reliability. 

23. Indeed, so little potency have those individual variations in 
causing unreliability that Woodyard! has found that there is little 
lowering of reliability for intervals up to a year. In other words, 
interval between repetitions of comparable tests up to a year has little 
influence on the correlation between comparable tests. 

24. Cheating may be a factor in the reliability of a test. Hines has 
described a wide variety of methods of cheating in school examina- 
tions.2 Cheating tends to make an individual score higher (or lower) 
than he otherwise would score and hence tends to lower the correlation 
between a test and the true score of individuals on that test. 

25. Another factor that has its effect on the reliability of a test 
score is the position of the function on the curve of learning. How- 
ever it is impossible to state definitely the influence of this factor. 
Experiments give different results. 

G. 8S. Gates (following H. L. Hollingworth)* gives the following 
reliabilities between successive repetitions. Each reliability coefficient 
is the average of five (color-naming, tapping, adding, multiplying and 
word-building) | 
ee ie oe. a a ie a ae 1-2 2-3 4-5 7-8 
as bc weds ndewaecascoctcdaa .62 .78 .85 .85 


Gundlach‘ found the following reliabilities, each being the average of 
three functions (number series, cancellation and multiplication) 


Se. ae 11-12 16-17 21-22 24-25 
Reliability......... 81 16 88 86 83 84 





1 Woodyard, E.: “The Effect of Time upon Variability.”” Teachers College 
Contributions to Education No. 216, 1926. 

2 Hines, H. C.: The Honor System and the Normal Curve. School and 
Society, Vol. XXVI, Oct. 15, 1927, pp. 481-486. 

3 Gates, G. S.: “Individual Differences as Affected by Practice.” Archives 
of Psychology No. 58, 1922. 

Hollingworth, H. L.: Correlation of Abilities as Affected by Practice. Jour- 

nal of Educational Psychology, Vol. IV, 1913. 

‘Gundlach, R.: The Effect of Practice on the Correlations of Three Mental 
Traits. Journal of Educational Psychology, Vol. XVII, Sept., 1926, pp. 387-401. 
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Slocombe! reports the following reliabilities for eight performances of 
analogies: 


a oa sscins pues HS 33 84 45 5&6 67 17-8 
Reliability.............. .86 .85 .80 .68 .72 .70 .84 


No trend can be recognized which is common to these three sets of 
results. Common experience usually ascribes a certain consistency 
of performance to those who have had considerable practice in a func- 
tion. Experts are usually counted upon to “‘come through”’ consist- 
ently as contrasted with the novice who may have beginner’s luck. 
Yet these same superficial observations fail to note whether the expert 
has gained in efficiency over his previous condition or even whether 
the expert may not show the same amount of measured variability 
that the novice shows. At present the available evidence leads us to 
believe that position of a function on the curve of learning has little 
relation to the reliability or consistency of that function. 





1Slocombe, C. S.: The Constancy of “g,’’ General Intelligence. British 
Journal of Psychology, General Seetion, Vol. XVII, Oct., 1926, pp. 93-110. 
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THE MEASUREMENT OF BRIGHTNESS 
EDWARD E. CURETON 
University of Oregon 


The purpose of this paper is to examine the psychological and 
mathematical difficulties inherent in the idea of the constancy of the 
IQ, and to derive a measure of brightness comparable to the IQ, but 
free from these difficulties. 

The mental abilities of different individuals increase with age, on 
the whole, at rates proportional to the relative degrees of brightness of 
these individuals, reaching their maximum at approximately the same 


age.':23 The only important exceptions are psychopathic cases, and 














Y= absolute mental ability. 














X= chronological age. 
Fig. I 


perhaps some of the lowest grades of intelligence (idiots and imbeciles). 
Thus if an individual at any given age possesses two-thirds the absolute 
mental ability of the average person of his age, this ratio will hold good 
(aside from unsystematic and minor fluctuations) for any other age. 
If we let y = f(x) represent the equation of the average mental growth 
curve, the equation of the growth curve of any individual may be 
written y’ = cf(x), c being a constant descriptive of the degree of 
brightness of the individual (see Fig. 1). Every ordinate of the growth 
curve of the individual will bear this fixed ratio c, to the corresponding 
88 
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ordinate of the average mental growth curve. We shall hereafter call 
this the intelligence ratio (IR). 

The IQ is ordinarily defined as the ratio of mental age to chrono- 
logical age. ‘The idea of the constancy of the IQ would then indicate 
that if an individual at any given age has an absolute mental ability 
equal to that of the average individual two-thirds his age, then this 
ratio would hold good for any other age. In mathematical terms, 
letting y = f(x) represent the equation of the average mental growth 
curve as before, the equation of the mental growth curve of an individ- 
ual becomes y’ = f(kx),k being the IQ. Every abscissa of the individ- 
ual curve will bear this constant ratio to the corresponding abscissa 
of the average curve. While this condition is manifestly impossible 
at the upper levels, it has been determined experimentally that at 
least up to 10 or 12 years of age, the IQ does not vary appreciably in 
any systematic manner.‘ The reason for this condition is that up to 
10 or 12 years of age, mental growth curves are not significantly differ- 
ent from straight lines; and the only set of curves that will satisfy simul- 


taneously the two conditions =! = c (constancy of the IR), and —> x(y) 


z(y’) 
= k (constancy of the IQ), is a set of straight lines passing through the 


origin (x = 0, y = 0) (see Fig. 1). 
Heinis* cites evidence to show that the equation of the mental 
growth curve corresponds quite closely, at least up to about age 30, 


to the equation y = b(1 —e 7 ), b and d being constants. For the 
average mental growth curve, he gives the particular value y = 


429(1 — e®-675). He also states that the only value which changes 





oe ; 1 — 6-675 
from one individual to another is b, whence we have = = ¢, 


z being mental age, x’ chronological age, and c a “‘ personal coefficient,”’ 
which is a constant measure of brightness. As defined, c is thus one 
measure of the IR, but is not, as presented, directly comparable 
(numerically) to the IQ, and requires extended computation or the use 


of special tables involving interpolation to determine. The curve 
—Zz 


y = 429(1 — 16675) is nowhere a straight line, but its deviation from 
straightness up to age 9 or 10 is small compared to the probable error 
of an individual score on the most reliable of our present mental tests. 
The equation as presented was derived from the results of a group test 
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given in France. The test is not described in the paper, and no evi- 
dence is given to show that this equation would hold for other tests. 
Hence the writer believes it safer for present purposes to hold only to 
the original assumptions regarding the form of the mental growth 
curve: Namely that it is approximately a straight line for the first few 
years of life, and later becomes convex, the relative rate of change and 
the time of final cessation of growth being dependent possibly on the 
nature of the function measured and independent, relatively, of the 
degree of brightness of the individual. 

The best measure of brightness, under the above assumptions, 
would be the IR, determined from an absolute mental scale (one on 
which the zero point represents just no mental ability whatever, and 
on which all score units represent equal increments of absolute mental 
ability). Unfortunately, no such scale has ever been developed. 
There are two reasons for this. In the first place, we do not know just 
what constitutes absolute mental ability ; and in the second, we have no 
adequate means for making reliable and valid mental measurements 
in the very low ranges. 

While we have hitherto been unable to measure absolute mental 
ability directly, we have been able to get at it indirectly through the 
knowledge that its development is a function of the maturity of the 
individual, which can be measured on an absolute scale: Namely 
chronological age. Hence by reference to the growth curve, any 


- ordinate (absolute mental ability) may be made to determine uniquely 


a particular abscissa (the mental age), and ratio comparisons of the 
ordinates thus become possible through reference to the absolute age 
scale, even though the actual absolute mental abilities cannot be 
determined. It is this fact which has given rise to the IQ technique. 

As long as the function connecting absolute mental ability with age 
is linear, the IQ will be constant; but as soon as it deviates appreciably 
from linearity, the IQ, though still a function of the absolute mental 
ability, is no longer a constant, and corrections must be devised to 
obtain a measure analogous to the IQ but remaining constant. If the 
function (the mental growth curve), in addition to remaining linear 
throughout the first few years of life, passes through the origin of the 
absolute mental ability scale and the age scale; i.e., if an individual 
may be assumed to possess zero mental ability at zero age; then the 
IQ as computed by reference to this curve will not only be a linear 
function of the IR, but the two will be identical. This condition is 
closely approximated in practice in the lower age ranges, there being 
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only two minor deviations. The first is the fact that the growth curve 
is not, in all probability, exactly linear even at the lower ages, but 
tends to be slightly convex. The second arises from the fact that the 
zero point on the chronological age scale is placed at birth rather than 
at some earlier point (possibly as far back even as conception) where 
absolute mental ability is actually zero. Neither of these deviations 
gives rise to errors of measurement of magnitudes comparable to the 
probable errors of scores on even our most reliable present-day tests. 

When we develop reliable tests for the very lowest age groups (from 
about 3 years down), it may be necessary to determine corrections for 
the zero-point of the chronological age scale. In the middle and upper 
ranges, such corrections are entirely unnecessary. 

In the upper age ranges (from about 12 years on), the mental 
growth curve becomes noticeably different from a straight line, and 
corrections must be applied to counteract this discrepancy if the 
mental age is to be used as the measure of mental level. Methods 
must also be devised for handling test scores higher than those which 
are average for any age. Since the IQ is determined by two factors, 
the chronological age and the mental age, there will need to be two 
corresponding corrections. 

The first correction (for chronological age) consists in dividing 
the mental age not by the actual chronological age nor by any arbitrary 
age (as 16), but by the average mental age of unselected persons of 
the given chronological age. The practice of dividing by the actual 
chronological age up to some stated time, such as 16, assumes that 
the mental growth curve is a straight line up to this age, and is per- 
fectly horizontal thereafter. The discrepancy between such a theo- 
retical curve and the true curve is considerable. In the first place, 
mental growth begins to slow down some time before age 16, which 
causes the IQ from 10 or 12 to 16 to become progressively lower, since 
chronological age is growing at the same rate as before, while mental 
growth is slowing down. After 16, the IQ begins to increase again, 
due to the fact that mental growth still continues (albeit slowly®) while 
the ‘‘chronological age”’ used in the divisor is held constant. Whether 
or not the final adult IQ is higher or lower than that of early childhood 
depends upon the nature of the mental function measured. In the 
case of Army Alpha, results show that it is probably lower. In any 
case, all adolescent IQ’s computed by this method will be too low. 


Dividing by the mental age equivalent to the actual chronological age 
will eliminate this difficulty. 
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The second correction (for mental age) consists in using not the 
actual age norms, but a set of artificial age equivalents obtained by 
referring the test scores to the age scale by means of a straight line 
passing through the origin. This line could be drawn at any angle 
(except parallel to one of the axes) without affecting the value of 
any IQ computed by it. However, if the concept of the mental age 
is to be retained in anything like its original significance, the line should 
be so drawn as to give the best possible fit to that portion of the growth 
curve (from about ages 6 or 7 to 10 or 12) which is known to be approxi- 
mately straight. 

If transformations made by reference to this straight line are to 
have uniform significance, and particularly if the line is to be extended 
above the range of scores which are average for any age to give extra- 
polated mental ages, the test scores must be so expressed as to repre- 
sent equal increments of absolute mental ability. This involves the 
two difficulties noted previously: The defining of absolute mental 
ability and the location of a zero-point. 

Thorndike® has shown that the only useful definition of mental 
ability is in terms of the specific tasks which it can accomplish. Using 
his terminology, therefore, we may define absolute mental ability as 
the altitude of intellect necessary to master tasks of a given nature 
(such as those found in some designated test), of any known degree 
of difficulty. He has also demonstrated that the total score on any 
ordinary mental test in which the problems are arranged in order from 
easy to hard, and which is not affected to any great degree by the time 
element, is a measure of altitude of intellect of the sort measured by 
the test. He has further shown that the distribution of altitude of 
intellect in any grade group approaches quite closely the normal 
curve, whence the intellectual difficulty corresponding to any given 
test score may be determined from the percentages of various grade 
groups who attain the given score, by reference to a table of the 
normal probability integral. By means of appropriate transforma- 
tions, the test scores may then be expressed in terms of a scale whose 
units actually represent equal increments of intellectual difficulty, 
and may therefore be presumed to represent equal increments of 
absolute mental ability (defined as above). The process by which 
these transformations are made is quite long, and will not be repeated 
here. Thorndike has described it quite fully (reference 6, Chapter 
7), and has derived equal-unit scales for a number of common group 
mental tests. 
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To locate the zero-point, we may plot the age norms on the scale 
of equal units against the given chronological ages, draw a smooth 
curve through these points, determine by inspection the straight por- 
tion of this curve, and extend it back to zero chronological age. This 
procedure shows how far the zero-point of the equal-unit scale is above 
the true zero of mental ability, assuming zero mental ability at zero 
chronological age. The straight portion of the curve may also be 
extended downward to the top of the scale of equal units to give 
extrapolated mental ages. IQ’s determined by reference to this line 
will be exactly equal to the IR’s which could now be computed by the 
use of the zero-point determined as above, and will be strictly com- 
parable (numerically) to those obtained by the usual procedures, the 
two being in fact identical over the straight portion of the curve. 

When we are able to measure intelligence in the lower ranges accur- 
ately enough to be able to detect significant differences between the 
growth curve and a straight line, this procedure will need some modi- 
fication. We will then fit a curve to the equal-unit age norms (up to 
10 or 12), extrapolate this curve back to zero (or perhaps — .75) chrono- 
logical age, and then draw such a straight line through this origin as 
will give the best fit (using the least squares or other suitable criterion*) 
to this section of the growth curve. 

The application of the principles above outlined is best shown by 
an example. The procedure will therefore be worked out for the Otis 
Advanced Examination. Before going on to this example, it may be 
well to review again briefly the assumptions on which the procedure 
is based. These are: 

1. Equal increments of difficulty of intellectual tasks of a given 
nature measure equal increments of the absolute mental ability 
necessary to perform such tasks. 

2. The mental growth curve, up to 10 or 12, is approximately a 
straight line. 

3. An individual possesses zero mental ability at zero age. For 
most purposes, as in the present example, zero age may be taken as at 
birth. 

The procedure is as follows: 

1. Express the test scores in terms of a sacle of equal units (see 
Table I, columns 1 and 2). 





* The other suitable criterion mentioned might be to draw the line so as to be 
parallel to the growth curve at some specified point (as at 12 years of age) where 
the IQ might arbitrarily be desired to be a “‘true” IQ. 
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TaBLe I.—Equal-unir anD MeEntTAL AGE EquivaLents oF Scores, Oris 


ADVANCED EXAMINATION 
Equal-unit Equivalents Are Taken from Thorndike,* Tables LX XIII and LXXIIIa 

















Beore —— MA |Score ane MA | Score — MA |Score oe MA 
10| -6.7| 6 8| 58| 58.1 | 10-1] 106 | 105.7| 12-8] 154 | 152.0 | 15-1 
11| -4.8| 6 9| 59| 59.2| 10-2] 107| 106.7 | 12-8] 155 | 152.9 | 15-2 
12| -3.0| 610] 60| 60.2| 10-3] 108| 107.6 | 12-9] 156 | 154.0 | 15-3 
13 | -1.1| 611 | 611| 61.3| 10-3] 109 | 108.6 | 12-10 | 157 | 155.0 | 15-3 
14 0.8 7-1 62 62.3 | 10—- 4 110 | 109.5 | 12-10 158 | 156.1 | 15—- 4 
15| 2.6| 7-2| 63| 63.4] 10-5] 111 | 110.5 | 12-11 | 159 | 157.2 | 15-5 
16 4.4 7- 3 64 64.4 | 10—- 5 112 | 111.4 | 12-11 160 | 158.3 | 15—- 5 
17| 6.2| 7-4] 65| 65.5| 10-6] 113 | 112.4] 13-0] 161 | 159.4 | 15-6 
i3| 8.0] 7-5| 66| 66.5| 10-7] 114| 113.3 | 13-11] 162| 160.5 | 15-7 
19} 9.8| 7-6| 67| 67.5| 10-7] 115| 114.3 | 13-1] 163 | 161.6 | 15-7 
20| 11.6| 7-71 68| 68.6| 10-8| 116 | 115.2| 13-2] 164 | 162.7 | 15-8 
21| 13.1] 7-8] 69| 69.6| 10-8| 117] 116.2| 13-2| 165 | 163.8 | 15-9 
22| 14.6| 7-91 70| 70.6| 10-9] 118| 117.1 | 13-3] 166 | 164.9 | 15-10 
23| 16.1| 7-10] 71 | 71.6| 10-10] 119 | 118.1 | 13- 4] 167 | 166.0 | 15-10 
o4| 17.6| 7-111 72| 72.6| 10-10} 120| 119.0 | 13-4] 168 | 167.1 | 15-11 
25| 19.1| s-0| 73] 73.6| 10-11] 121 | 120.0| 13-5] 169 | 168.2| 16-0 
26 | 20.6| 8-1] 74| 74.6] 11-0] 122/| 120.9| 13-5] 170] 169.3 | 16-0 
o7| 22.1] s-2}) 75| 75.6| 11-0] 123 | 121.9| 13-6] 171 | 170.4 | 16-1 
03 | 23.6] 8-3} 76| 76.6| 11-1] 124] 122.8| 13-7] 172! 171.5 | 16-2 
29 | 25.0| s-4| 77\| 77.6| 11-2] 125 | 123.8 | 13-71 173 | 172.6 | 16-2 
30 | 26.4| 8-6] 78| 78.6| 11-2] 126] 124.7| 13-8] 174 | 173.7| 16-3 
31| 27.7} s-6| 79| 79.6| 11-3| 127| 125.7| 13-8| 175 | 174.8] 16-4 
32| 28.91 8-6} 80] 80.6] 11-4] 128 | 126.6| 13-9] 176 | 175.9 | 16-5 
33 | 30.2} 8-7] 81| 81.6| 11-4] 129] 127.6 | 13-10 | 177 | 177.0 | 16 5 
34| 31.4] 8 8] 82] 82.5| 11-5] 130] 128.6 | 13-10 | 178 | 178.1 | 16-6 
35 | 32.6| 8 9| 83] 83.5| 11-5] 131 | 129.5 | 13-11 | 179 | 179.3 | 16-7 
36 | 33.8| 8-10} 84] 84.5| 11-6] 132] 130.5 | 14-0] 180| 180.5 | 16-8 
37 | 35.0} 8-10] 85| 85.4| 11-7] 133 | 131.4] 14-0] 181 | 181.6 | 16-8 
38 36.2 8-11 86 86.4 | 1l- 7 134 | 132.4 | 141 182 | 182.8 | 16- 9 
30 | 37.4] 90] 87| 87.4| 11-8] 135 | 133.4 | 14-1] 183 | 183.9 | 16-10 
4o| 38.6] 91] 88| 88.3| 11-9] 136 | 134.3 | 14-2] 184 | 185.1 | 16-10 
ai1| 30.7} 91] 89] 89.3| 11-9] 137] 135.3| 14-31] 185 | 186.3 | 16-11 
42| 40.8} 9-2] 90] 90.3| 11-10] 138 | 136.2 | 14-3] 186 | 187.5 | 17-0 
43} 41.9| 9 3]| 91] 91.3| 11-10] 139 | 137.2 | 14-4] 187| 188.7 | 17-1 
aa4| 43.0} 9-4] 92] 92.2| 11-11] 140| 138.2| 14-41] 188 | 189.9 | 17-2 
45 44.1 Q—- 4 93 98.2 | 12-0 141 | 189.1 | 14 5 189 | 191.1 | 17- 2 
46| 45.2| 9-5] 94] 94.2| 12-0] 142| 140.1| 14-61] 190 | 192.3 | 17-3 
47 46.3 9 6 95 95.1] 121 143 | 141.0 | 14-6 191 | 193.6 | 17— 4 
48 47.4 oe 6 96 96.1 {| 12— 2 144 | 142.0 | 14-7 192 | 195.1 | 17- 5 
49 48.4 9 7 97 97.1 12—- 2 145 | 143.0 | 14 8 193 | 196.6 | 17- 6 
50| 49.5| 9-8! 98| 98.0] 12-3] 146 | 143.8 | 14-8] 194] 198.1 | 17-7 
51| 50.6} 9-8| 99| 99.0] 12-3] 147| 144.9] 14-9] 195 | 199.7 | 17-8 
52| 51.6| 9-91 100 | 100.0| 124] 148 | 145.8 | 14-9] 196 | 201.3 | 17-9 
53 52.7 9-10 101 | 101.0 | 12 5 149 | 146.8 | 14-10 197 | 203.0 | 17-10 
54 | 53.8| 9-10] 102 | 101.9| 12-5] 150 | 147.8 | 14-11 | 198 | 204.7 | 17-11 
55 54.9 9-11 103 | 102.9 | 12- 6 151 | 148.9 | 14-11 199 | 206.5 | 18- 0 
56 | 56.0 | 10-0] 104 | 103.9 | 12. 6| 152| 149.9 | 15-0 | 200 | 208.3 | 18-1 
57 57.0 | 10—- 0 105 | 104.8 | 12 7 153 | 151.0 | 15- 1 












































Entries in the MA columns are obtained as follows: 
1. From Fig. 2, z = .05333y + 7 where z = age in years and y = equal-unit score. 
2. Setting z in terms of months, we have z = .64y + 84. 
3. Multiplying each entry in the equal unit column by .64, adding 84, and changing from 
months to years and months, we obtain the corresponding entry in the MA column. 
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2. Express the age norms in terms of the scale of equal units (see 
Table II). It is well to note in this table also, the limits of the equal- 
unit scale as derived. 


TasBie II.—Equat-unit EqurIvaLents or AcE Norms, Oris ADVANCED 
EXAMINATION 


Raw Score Norms Are Taken from the Otis Group Intelligence Scale, Manual of 
Directions, 1921 Revision, World Book Co., Page 71 

















Age Norm, raw scores Norm, equal-unit scores 
cal 10* — 6.7* 
8-0 25 19.1 
9-0 40 38.6 
10-0 55 54.9 
11-0 68 68 .6 
12-0 80 80.6 
13-0 90 90.3 
14-0 100 100.0 
15-0 110 109.5 
16-0 120 119.0 
17-0 127 125.7 
18-0 130 128.6 
19-0 130 128.6 
vores 200* 208 .3* 

* Limits of the equal-unit scale. 


3. Plot the growth curve, taking the values of the equal-unit scale 
as ordinates and chronological ages as abscissae (see Fig. 2). 

4. Determine by inspection the straight portion of this curve (at 
the lower end), and extend this straight line up and down to include 
the limits of the equal-unit scale (see Fig. 2). Note that under the 
assumptions made, this line may for theoretical purposes be extended 
back to the zero of the age scale, at which point it will determine the 
absolute zero of the equal-unit scale (see Fig. 3). 

5. This straight line may now be used to determine the age corre- 
sponding to any equal-unit score. Write down the equation of this 
line. Referring to Table I, columns 1 and 2, determine by this equa- 
tion the mental age equivalents of the raw scores (see Table I, column 
3). 

6. Referring again to Fig. 2, find the age on the straight line corre- 
sponding in equal-unit score value to each age on the growth curve as 
plotted. Tabulate these chronological age equivalents (see Table III). 
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TaBLE III.—Curono.oaicat AcE EquivaLents, Oris ADVANCED EXAMINATION 





Years 
Months Adult 


8 9 10 | 11 12 | 13 | 14; 15 | 16] 17 








8-0 | 9-0 | 9-11)10-9 |11-4 |11-1012-4 |12-10|13-4 13-9 | 13-11 
8-1 | 9-1 |10-0 /10-10/11-4 /11-10 12-4 |12-10/13-4 |13-9 
8-2 | 9-2 /10-1 |10-11/11-5 |11-11/12-5 |12-11/13-5 |13-9 
8-3 | 93 |10-2 |10-11/11-5 |11-11/12-5 |12-11/13-5 |13-10 
8-4 | 9-4 |10-3 |11-0 |11-6 |12-0 {12-6 [13-0 |13-6 |13-10 
8-5 | 9-5 |10-4 |11-0 |11-6 |12-0 |12-6 |13-0 |13-6 |13-10 
8-6 | 9-6 |10-4 |11-1 (11-7 |12-1 |12-7 13-1 |13-7 |13-10 
8-7 | 9-7 |10-5 |11-1 |11-7 |12-1 |12-7 |13-1 [13-7 |13-10 
8-8 | 9-8 |10-6 |11-2 |11-8 |12-2 |12-8 |13-2 [13-7 13-11 
8-9 | 9-9 |10-7 [11-2 |11-8 |12-2 |12-8 |13-2 |13-8 [13-11 
8-10| 9-10 10-8 (11-3 |11-9 |12-3 |12-9 |13-3 |13-8 (13-11 
8-11 9-11)10-9 11-3 |11-9 12-3 12-9 13-3 |13-9 13-11) 


Entries in this table are obtained by reference to Fig. 2. 
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Given now an individual’s age and his raw score, his IQ will be 
determined by dividing his mental age equivalent as determined from 
Table I, by his chronological age equivalent as determined from 
Table III. 

Insofar as the three assumptions on which this procedure is based 
are valid, and within the chance errors of the test scores, the age norms 
and the equal-unit scale equivalents, the IQ thus found will have the 
same significance, in both the psychological and numerical senses, as 
one determined by the usual methods, and will have the same signifi- 
cance for adolescents and adults as for children. 

Wilson® has suggested that since some mental functions increase 
more rapidly with age than others, the standard score being used as 
the unit of measurement in each case, and since the real variability of 
individuals is not a linear function of chronological age, the standard 
score itself should be used as the measure of brightness. Aside from 
the cumbersomeness in use of a table of standard scores for every 
month of chronological age, the writer believes that the IR, defined as 
the ratio of absolute mental ability to average absolute mental ability 
and measured by the IQ as derived above, is a sounder measure of 
brightness than is the standard score, for the very reasons given by 
Wilson in support of the standard score. 
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A QUANTITATIVE SCALE FOR RATING THE HOME 
AND SOCIAL ENVIRONMENT OF MIDDLE CLASS 
FAMILIES IN AN URBAN COMMUNITY: A 
FIRST APPROXIMATION TO THE 
MEASUREMENT OF SOCIO-ECONOMIC 
STATUS* 


F. STUART CHAPIN 


University of Minnesota 


The measurement of socio-economic status is a problem of con- 
siderable interest.t| Tothe student of human behavior itis important 
whenever research requires a quantitative measure of the social 
environment in order that this factor or group of factors may be 
equated in experimental study. To the social worker the measure- 
ment of socio-economic status is important whenever it is desired to 
secure an objective measure of the home environment for purposes of 
placing children in foster homes. 

The purposes of this article are to explain and make available: 
First, for research workers, a rough scale for rating the home and the 
social environment in order to permit the equating of these factors in 
experimental study; and second, for social workers, an objective 
measure of the home environment for prospective foster homes. 

Although we wish to measure socio-economic status it is not easy 
to agree on a definition that will be adequate for all purposes. Conse- 
quently the following definition is arbitrarily made and offered for 
what it is worth as an assumption from which to start in an attempt 
to measurement. ' 


Socio-economic status is the position that an individual or a family 


occupies with reference to the prevailing average standards of cultural 


possessions, effective income, material possessions, and participation in 
group activity of the community. In this definition we arbitrarily 
assume for purposes of making a start in the study of this problem, 
that there are the four objective and measurable elements in family 
life just enumerated. 


t From the Department of Sociology and the Institute of Child Welfare 
University of Minnesota. 

* See bibliography at end of paper referred to in footnote by numbers. 

The author is indebted to the following graduate students for assistance in this 
study: Mrs. M. K. Doyle, Mrs. Anne F. Fenlason, Harold Hosea, Mildred Parten, 
Ruth R. Pearson, Madga Skalet, Marjorie Walker and Sanford Winston. 
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In the measurement of social phenomena we may resort to any of 
the following devices as means of getting our observations in quan- 
titative form. 

1. Applying to a new problem a quantitative scale devised or per- 
fected in other investigations. 

2. Constructing a scale of ratings set by the consensus of opinions 
of experts from whom independent judgments are procured. 

3. Constructing a scale by arbitrarily weighting the data. 

In this study each of these three methods was used in an effort to 
measure the four elements assumed to constitute socio-economic status. 
Let us now consider each of these elements and the scale constructed to 
measure it. 

Cultural possessions were defined to include books, newspapers, 
periodicals, telephone, radio, musical instruments, sheet music, phono- 
graphic records, phonograph, etc. Presence of each element in the 
home was given an arbitrary weight. The final culture score for any 
family was the sum total of these weights. The highest culture 
score of any family was 492, the lowest culture score was 2. The 
weights were independently assigned by two persons familiar with the 
records. The results of these two scoring efforts were compared, found 
to be quite consistent, and harmonized where needed. The resulting 
scale was used to score the families on the basis of very careful and 
complete records obtained by home visits and interviews by an experi- 
enced social case worker. 

Effective income is the number of dollars in the net income per 
ammain. The ammain (abbreviation for adult male maintenance) is 
the ‘gross demand for articles of consumption having a total money 
value equal to that demanded by the average male of that class at the 
age when his total requirements for expense of maintenance reach a 
maximum.”’!® The scale is that of Sydenstricker and King based 
upon a study of complete budgets of 140 families and from food records 
of 1500 families collected in 1917 from residents of 20 cotton mill vil- 
lages of South Carolina. 

Participation in group activity of the community was rated on the 
following basis. About 40 executives in the social agencies of the 
Twin Cities were asked to put the following in their order of the impor- 
tance as evidence of participation in group activity: Membership, 
contributions, attendance, committee membership, and official posi- 
tions in clubs, organizations and community activities. No distinction 
was made to rate different activities differently. Each activity had 
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the same weight. Replies to this questionnaire showed a clear major- 
ity in favor of the following order beginning with the least important 
evidence of participation in group activities: (1) Membership, (2) 
attendance, (3) contributions, (4) membership on committees and (5) 
position as an officer. This order was followed in assigning arbitrary 
weights. The sum of the weights of the father and the mother in each 
family was taken as the score of that family for participation in group 
activity. For sake of convenience this score will be called the group 
setting index of the family. For example, if the father belonged to 6 
clubs or organizations, attended 5, contributed to 3, was a member of 
2 committees and was an officer in organization, his total score would 
be 38. The mother’s score would be similarly computed and the sum 
of the two scores would be the group setting index of the family. The 
scores ranged from 0 to as high as 119 for one family. 

Material possessions refer to household equipment. An elaborate 
schedule enumerating some 200 items and classes of items under main 
headings such as fixed features, built-in features, standard furniture, 
basic equipment, furnishings and family conveyances, was prepared 
and information under as many categories as applied was gathered 
from each family by a social worker. A scale of arbitrary weights 
was then prepared using the same procedure as outlined for the 
scoring of cultural possessions. Each family was then scored for 
its household equipment on the basis of the data collected on 
these schedules. 

The consequences of this procedure were that each family was 
scored on four different and independently derived scales The gross 
results are shown in Table IA. Interpretation of these data is now to 
be considered. But before presenting these results it would be well to 
state that 38 families were studied in this way. These families main- 
tained the homes from which children under five years of age came to 
attend the nursery school of the Institute of Child Welfare of the 
University of Minnesota. This is a research institute in child welfare 
supported by the Laura Spelman Rockefeller Memorial. Research 
studies are here carried on cooperatively by the following scientific 
departments of the University: Anatomy, pediatrics, nervous and 
mental diseases, dietetics, educational psychology, psychology and 
sociology. The families are very cooperative in the furtherance of the 
research, and represent a middle class population following occupations 
that cross-section the occupational groups of Minneapolis, but with no 
representatives of the lowest laboring class and somewhat dispropor- 
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Taste [A.—Institute or Cuiup WELFARE Scorzs 1926-27 








(3) (4) 
(2) : (5) (6) 
. (1) Culture | Biective | Group | souschold | Living 
amily income setting " 
score . , equipment room 
(ammain) index 

1 126 556 3 354 43 
2 21 612 15 293 55 
3 29 1194 44 321 46 
| 4 2 498 13 208 36 
5 61 1549 93 301 65 
6 123 988 24 372 64 
| 7 248 1940 27 424 75 
| 8 9 544 11 241 47 
9 3 506 18 273 36 
10 132 1099 58 342 53 
11 31 303 0 165 45 
12 308 3448 84 430 57 
13 48 452 17 113 20 
14 9 836 22 325 42 
15 104 1095 16 327 66 
16 99 1350 13 321 69 
17 39 1045 35 249 63 
is | 236 882 28 284 48 
19 25 2510 80 504 59 
20 109 918 58 281 52 
21 267 1325 34 335 28 
22 179 1973 102 284 41 
23 4 763 0 144 27 
q 24 32 1727 28 258 52 
Ps 3 25 20 1009 12 223 33 
1 26 151 2000 88 394 54 
| 27 24 1086 12 286 60 
| 28 14 433 12 170 53 
29 79 1870 15 286 47 
; 30 434 - 1500 57 343 65 
31 128 | 1972 26 286 49 
32 63 1681 20 324 57 
33 159 1038 93 332 44 
34 27 1690 0 186 48 
35 8 908 26 251 42 
36 12 596 43 250 43 
37 212 1914 119 513 89 
38 61 925 8 175 53 
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Taste IB.—InstTiTuTe or Cuitp WELFARE Scores 1926-27 


























(1) Family (7) Holley scores (8) Chapman-Sims scores 
ee ee ee ee oe 46 
2 3,032.89 44 
3 4,398.80 42 
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tionately weighted from the families of the professional and managerial 
classes (see Table IT).* 

Now to consider the interpretation of the four series of scores on 
the 38 families. The question at once arises, does a family that is 
scored low on cultural possessions also score low on effective income, 
etc.? Does a family that scores high on group setting index, also 
score high on household equipment, etc.? In other words, are these 
four scales consistent in the measurement of what they are assumed to 
measure? 


TaBLE I].—Cueck on SAMPLING OF INSTITUTE OF CHILD WELFARE FAMILIES 























p Per cent population of 38 Institute 
5 * ome of Child Welfare families 
Occupational class population of 
Mi lis 
sete Number Per cent 
I 5.4 11 28.9 
II 6.3 7 18.4 
III 37.3 11 28.9 
IV 24.3 5 13.1 
V 14.9 4 10.5 
VI 11.8 0 
Mk cb bnd Heck aw bias 100.00 38 99.8 














To test the data against these questions we have computed correla- 
tion coefficients as follows; X, = cultural possessions score, X2 = effec- 
tive income score, X3 = group setting index, and X, = material 
possessions or household equipment. The results of these computa- 
tions are presented in Table III (column 2). It will be observed that 
the six correlation coefficients range from as low as +.55 to as high as 
+.68. The average of the six coefficients is +.62 and five of the six 
coefficients are +.61 or;over. These coefficients are not as high as 
correlations sometimes found in social and psychological studies; on 
the other hand they are substantial coefficients. The significant con- 
clusion to draw is that four entirely different and independently derived 


scoring methods have been applied to the measurement of the same group of 


families and have given correlation coefficients which are significant in 
size and in substantial agreement with one another. May we conclude 





* Goodenough, F. R.: “The Kuhlman-Binet Test for Children of Pre-school 
Age: A Critical Study in Evolution.’”’ Institute of Child Welfare Monographs 
Series, No. 2, University of Minnesota Press, 1927. 
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that these interpretative data supply evidence for believing that we 
have approximated the measurement of some common underlying 
uniformity such as socio-economic status? ‘To answer the questions 
raised in the preceding paragraph we may say that in general the house- 
hold equipment of a family varies directly with the group setting index. 

But further analysis is necessary. What would happen if we 
studied the way in which cultural possessions varied with group setting 
index if the relationship was undisturbed by either or both of the other 
two variables? ‘To answer this question we have to resort to analysis 
by partial correlation By this device we may measure the relation- 
ship between the positions of a family on any two of the scales with 
either one of the others, or both of the others held constant. The 
partial correlations are presented in Table III (columns 3 and 4). 
Analysis of this table shows that the relationships fall into three well- 
defined groups. First, when on the one hand culture score and group 
setting index are correlated, or on the other hand effective income and 
household equipment scores are correlated, meanwhile holding the 
other variables or holding the other two variables constant, the correla- 
tion coefficients are not reduced to an insignificant size (see row A of 
Table III). Second, whenever culture and effective income on the 
one hand, or group setting index and household equipment on the other 
hand, are correlated, the partial coefficients drop away. Third, the 
remaining pairs of partial coefficients take an intermediate position. 
The conclusion indicated from this analysis is that measurements of 
culture and of group setting index are of some common underlying 
trait. Let us arbitrarily call this sociality as the term is used by All- 
port.* Moreover measurements of effective income and of household 
equipment are of some common underlying trait. Let us arbitrarily 
call this economic status. These conclusions seem not inconsistent 
with common sense inference. 

If, for the moment, we grant tentatively that our study has made 
progress towards the measurement of socio-economic status as defined, 
we may now ask the question, “‘How much progress?” Further analy- 
sis of our data suggests partial answers to this last inquiry. In Table 
III (bottom row) are shown the multiple correlation coefficients, 
Rus), Ris, o, and Ris, 2,4). 

It will be noted that these coefficients increase in size from the 
initial one to the last as the successive variables are taken into con- 





* Allport, F. H.: “Social Psychology.” Pp. 103, 122-125. 
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sideration such that the final multiple coefficient is .7224. This means 
that we have made some measurable progress towards recording the 
essential elements that characterize the phenomena studied. 








TaB_eE III 
(1) (2) Zero order (3) First order (4) Second order 
coefficients* . coefficients coefficients 
re @+-55+.114 | ry, = + .204r,,,—=+.273 | v4, =| + .1009 
ri, = + .68 + .09 risa = + -5ll r= + .478 | oy, | = + 4163 
r. = +-61 + .10 Tua = + -410r,,, = + .313 Tiss = + - 2622 


= 


tT, @= + .64+ .10 To. ™ + -4847r,,, = + .403 
Tr, @= + -62 + .099 To. = + .439r,,, = + .369 
‘'. = + .63 + .097 Pes.s = + .384 Tog ™ + 398 


23.14 = + .3202 
24.13 = + .38278 
uaz = + +2870 


wer QQh BW 





Ri) = - 88; Ris, 2) = +6969; Riis, 2, 4) ™ +7224 














* Standard errors used throughout. 


A second partial answer to the question is given in Table IV, in 
which are presented correlations between our ratings and two other 
rating scales devised by Chapman and Sims* in one case and Holley® 
in the other, applied to our 38 families. It will be noted that the 
average correlation between the Chapman-Sims rating and ours on the 
38 families is .61, and that the average correlation (Rho rank order) 
between the Holley rating and ours on 18 families for which the data 
are comparable, is .67. 

The construction of a valid scoring method to measure socio- 
economic status depends: (1) On the representative character of the 
sample studied (see Table II); and (2) on verification in independent 
tests. These two criteria have been discussed. Let us now proceed 
to two other problems, (1) analysis of our data on socio-economic status 
in terms of the group of 38 families as a self-contained group without 
special reference to devising a scoring method of universal validity, 
and (2) construction of a simply scored scale which shall have high 
predictive value for the position of these families on each of the four 
scales already devised. 

To analyze our 38 families as a self-contained group we have com- 
puted the SD of the culture scores for all families, the SD of the 


income scores for all families, etc., and then divided the deviation of 
each particular family score from the mean by the SD of that respective 
scale. This was done for all four scales, thus giving us a table of family 
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position (status) in terms of sigma deviations. The average sigma 
deviation of each family on the four scales was then computed. 

Two methods of analysis of these tabular data of sigma deviations 
were then followed; (1) At the suggestion of Mr. Harold Hosea profile 
diagrams were charted, and (2) correlation coefficients were computed. 

The profile diagrams show little uniformity except a tendency 
toward somewhat more flattened profiles in families of lower occupa- 
tional and economic status, and more angular profiles in families of 
higher occupational and economic status. 


Tass [V.—INstTITUTE OF CHILD WELFARE Socio-Economic Status CoRRELATED 
WITH CRITERIA 











Sentttate of Institute of Institute of Institute of 
Criterion Child Welfare Child Welfare | Child Welfare | Child Welfare A . 
eultune effective group setting | household ae 
income index (both) | equipment 
Chapman-Sims (38 cases). + .721 + .563 + .606 + .621 +.61 
+ .078 + .110 +.101 +.101 
Holley (18 cases)......... + .753 +.768 + .65+ + .512 + .67 




















Analysis by correlation coefficients is shown in Table V (columns 
6to11). From this table it is evident that the intercorrelations among 
sigma deviation scores run higher than among gross scores, ranging in 
the former from +.593 to +.757 and averaging +.692 (columns 7 to 
10), and in the latter ranging from +.55 to +.68 and averaging +.62 
(columns 2 to 5). Perhaps the most interesting result of this analysis 
is that correlations of average sigma deviation scores with other scoring 
methods run high, namely: With Holley +.818, with Chapman-Sims 
+.741, and with our living room score +.794 (column 6). 

This brings us to our last problem, namely the construction of a 
simply scored scale which shall have a high predictive value for the 
positions of these 38 families on our four scales for culture, income, 
group activity, and household equipment. The closing statement of 
the preceding paragraph is pertinent here, namely, the high correlation 
coefficients of Holley, Chapman-Sims and our living room scale with 
the composite or average sigma deviation score, indicates that either 
one of these three scales! (column 6) has high predictive value in foré- 
casting what a family’s status is likely to be on a composite score 
which includes cultural factors, effective income, participation in 


group activity and material equipment of the household for the 
group of families studied. 
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ScaLE oF WEIGHTS FOR RATING THE EQUIPMENT OF THE LIVING ROOM OF AN 
Urspan Mippie Crass FAMILy 


Instructions 


The following list of items is for the guidance of the recorder only. All of the 


features listed will not be found in any one home. 


Entries on the schedules, 


however, should follow the order and numbering indicated. Weights appear after 
the name of the respective items. 


I, Fixed features. 


10. 


11. 


12, 


13. 
14, 


Softwood 1, hardwood 2, com- 
position 3, stone 4. 


b INL. Ac sw asecceccses —_—— 


Composition 1, carpet 2, small 
rugs 3, large rug 4. 


y JP od we cccecccccene —-- 


Paper 1, kalsomine 2, plain paint 
3, decorative paint 4, wooden 
panels 5. 


Painted 1, varnished 2, stained 
3, oiled 4. 


. Door protection.............. —-- 


Screen 1, storm door 1. 


I 23 08s pee assesses —--—- 


1 each. 


. Window protection............ me 


Screen, blind, netting, storm 
sash, 1 each. 


. Window covering............. —_—-— 


Shades 1, curtains 2, drapes 3. 


Andirons, screen, poker, tongs, 
shovel, brush, hod; basket, rack, 
1 each. 


Stove 1, hot air 2, steam 3, hot 
water 4. 

Artificial light................ 
Kerosene 1, gas 2, electric 3. 
Artificial ventilators 1.. wfeeees 


II. Built-in features. 
15. Book containers.............. 


16. 


17. 
18. 
19. 


Shelves 1, cases 2. 


In-a-sideboard 1, in-a-ceiling 2, 


in-a-door 3. 


Window seats 1............... 
Window boxes 1.............. 





Clothes closets 1.............. —— 


21. 
22. 
23. 


24. 


25. 
26. 
27. 
28. 


29. 
30. 
31 
32 


36. 
37. 


40. 


41. 


. Mirror 1 


III. Standard furniture. 


IL ins utale halal s&s bow snd we 8 ——- 
Sewing 1, writing 1, card 1, 
library 2. 
ESS ee ee —-.- 
Straight, rocker, arm-chair, 


high-chair, 1 each. 

Stool or bench...............4. 
High stool, foot-stool, piano 
stool, piano bench, 1 each. 





Cot 1, sanitary couch 2, chaise 
longue 3, day-bed 4, davenport 
5, bed-davenport 6. 

Business 1, personal-social 2. 
a EE SEO 
Wardrobe or movable cabinet 1 
Sewing cabinet 1.............. 
Sewing machine...... ...... 
Hand power 1, foot power 2, 
electric 3. 





“vee eee ewe ewwe 


ee 


ee 


ee eee eee ee eee eee 


Furniture, table, chair, couch, 
piano, 1 each. 


Floor (large) floor-bridge, 1 each. 
Candle holders, 1 each......... 
Mantel, grandfather, wall, 
alarm, 1 each. 


Factory made, hand made, 
waste, sewing, sandwich, decor- 
ative, 1 each. 

Statues 1 
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a —— 


party line 2, one-party line 3 

43. Photographs 1 each (portraits (note social or business mainly). 
of personal interest............ —— ee Gan ce cab cb Rehch Gacewes —- - 

a eT EE —— Crystal 1, one-tube 2, two-tube 
Under type: original, reproduc- 3, three-tube 4, superheterodyne 
tion. Under median: (original), 5, ete. 
oil, water color, etching, wood 50. Musical instruments........... — 
block, lithograph, crayon draw- Piano 5, organ 1, violin 1, ete. 
ing, pencil drawing, pen and 51. Mechanical musical instruments——— 
ink, brush drawing, photograph Music-box 1, phonograph 2, 
(when treated as a work of art); player-organ 3, player-piano 4, 
(reproduction) photograph, half etc. 
tone, color print, chromo (a) 52. Gheet music.............000.. —-—— 
adult, (6) child, 1 each. Opera, folk, military, ballads, 

Gy itn kb5 otat dec descdscaz — classic, jazz, dance (other than 
Poetry, fiction, history, drama, jazz), children’s, exercises, .05 
biography, philosophy, essays, for each sheet. 
religion, art, science (physical, 53. Phonograph records........... —— 
psychological, social), atlas, dic- ‘Type of music (as above); type 
tionary, encyclopedia, .20 for of instrument reproduced; voice 
each volume. —solo, duet, quartet, chorus; 

GB, TIOUGIGEB 0 occcccccccccnchs —_— instrumental—solo, instrument 
General, labor, local community, (piano, violin, ete.), trio, 
sectarian, 1 for each paper. quartet, band, orchestra, .10 for 

Cie Uc cccsscevedoscteces —_—- each record. 

News (current events), pro- Total 
fessional, religious, fraternal, jo qe.  Totall....................6: —_——— 


literary, popular, science, art, 
fashion, popular story, chil- 
dren’s, 1 for each. 


TE ed 
TER ees —— MOM B...... eee eeeccccccess —— 
RSE OI aa Sea oe 
Ge IG «nda ccccéetwaewes —_-—- 
Switchboard connection 1, two- Grand Total 
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Since our living room score is easily obtained in a comparatively 


short interview which does not involve any objective inquisitorial 
questions, and since in addition it has high predictive value for the 
family’s socio-economic status in a very broad sense of the term, we 


offer this scale of living room scores as a convenient device to measure 
socio-economic status. 


on 


10. 


11. 


12. 


13. 
14, 
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I. SOME COMMENTS ON PROFESSOR THURSTONE’S 
METHOD OF DETERMINING THE SCALE VALUES 
OF TEST ITEMS 


KARL J. HOLZINGER 
University of Chicago 


The purpose of this article is to point out certain limitations in 
the scaling proposed by Professor L. L. Thurstone in the October, 
1925, issue of this Journal and further illustrated by him in the Novem- 
ber, 1927 issue. A new probable error formula will also be derived. 
This formula will be used in part to test the validity of some of Pro- 
fessor Thurstone’s assumptions. A brief account of his method will 
first be given, using a notation similar to that employed by him in 
the 1925 article. The list of symbols to be used is given in the list 
which follows. 


NOTATION 


M, and M; = the means of two normal curves representing the 
distributions of ability for two adjacent age or grade groups. 

o;, and oz = standard deviations of groups 1 and 2. 

A, and ,X2 = scale values of item k for groups 1 and 2. These 
values are measured in o units from the means of the respective 
groups, and are determined in the usual way by noting the percentages 
of pupils who answer question k correctly. 

n = number of test items to be scaled. 

m, and mz = means of the n X; and X_ values respectively, e.g. 


k=n 
m = y 2 (,X1)/n 
k=1 


S, and S. = standard deviations of X; and Xz for the n items. 

~Z1 = pX101 and .Z2 = -Xqo2 are the final scaled values of item k 
measured from M, and M». 

p21 = r41 + M, and .Z2 = «Z2 + Me are the final scaled values, 
measured from the origins of groups 1 and 2. 

E12 = 4Z1 — ~Z2 18 a measure of the error in the final scaled values. 

Sz = standard deviation of E12 for n items. 

r = correlation between X, and Xe, or Z; and Zz for n pairs of 

items. 


S, = standard deviation of Z for n items. 
112 
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ResuME oF ProFressor THURSTONE’S METHOD 


Professor Thurstone’s method is to assume that 


M, + 221 = M2 + rhe (1) 
From this equation it follows at once that 
S,, = S,, 
or, 
Si1 = Soe (2) 
Summing over the n values of equation (1) we find that 
M; + mo; = M2 + moo (3) 


Equations (2) and (3) are equivalent to equations (6) and (7) in 
Professor Thurstone’s 1925 article and are the essence of his method. 

The constants S,, S:; m:, and mz are given by the data, while o; and 
M;, are determined from other groups or are given arbitrary values. 
Equations (2) and (3) may then be solved for the two remaining 
quantities, 7. and Mz. The final scaled values are then given by the 
formulas 


tZ1 = 221+ M, = .X0i1+ M, (4a) 
and 


rZ2 = plo + Mz = .Xo2 + Me (4b) 


In this 1925 article, Professor Thurstone recommends taking the 
arithmetical mean of these two scaled values or 


cZ = {X01 + Mi) + (-Xee2 + M2)} (5) 


In case the n test items are to be scaled for a number of groups, 
Professor Thurstone merely extends the above method to other pairs 
of adjacent groups, giving o; and M;, arbitrary values. The constants 
g2 and Mz may then be found from groups 1 and 2, while o; and M; 
may be found from groups 2 and 3, etc. The various ,Z values thus 
found are then averaged by using certain weights. 


DISCUSSION OF THE METHOD 


Returning to equation (1) we may now examine the limitations of 
Professor Thurstone’s method. This equation is, of course, a general 


one, applying to all pairs of adjacent groups. It therefore follows 
that 


S:, == Sz, = S., a oes Sz; 
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In other words, it is assumed that the standard deviations of the 
final scaled values ,Z are equal for all groups. Let us first suppose 
that equation (1) is exactly true for the n test items, and second that 
it is only approximately true. 

If equation (1) is an exact relationship between the variables, then 


Ze =rZet+ Mi =+ + + pZe+ Me = 241+ Mi (6) 


This means that the scale values of item k are exactly equal for all 
groups and therefore equal to ,X10: + M: which is found from the 
first group. Nothing would be gained, therefore, by using more than 
one group for the scaling. 

It should be pointed out that this limitation holds in the case 
where all of the items can be given to the pupils in a particular group, 
and where no item will be passed or failed by all pupils of that group. 
This we believe to be the usual situation in scaling tests, so that the 
above objection is important. 

As an example of this case we may take the Trabue data cited on 
p. 522 of Professor Thurstone’s 1927 article. The scaling equation 
for the items in Grade II becomes ,Z2 = 3.000 + 4X2, since Professor 
Thurstone has taken Mz = 3.000 and oz = 1.000. Professor Trabue’s 
scaling equation for these items may be written in the form ,.72 = 
1.687 + .6745,X2. It is therefore apparent that .Z2 and ;7': are 
both linear functions of ,X2, and one is no better than the other. 

Thus sentence 1 was passed by 64.3 per cent of the pupils in Grade 
II. Looking up .143 in tables of the deviates of the normal curve, 
we find ,X2 = —.366. Professor Thurstone therefore assigns the 
scale value 3.000 — .366 = 2.634, while Professor Trabue gives 
1.687 — .6745 X .366 ='1.14. 

Professor Thurstone then determines seven other scale values for 
question 1, these being; 2.537, 2.291, 2.292, 2.778, 3.122, 3.231 and 
3.164 with an average weighted value for all eight of 2.665. We hold 
that the elaborate calculation of these seven values and their weighted 
average is entirely superfluous, since they were all assumed to be equal 
at the start. It would have been just as reliable and enormously sim- 
pler to use only one group consisting of a large number of individuals. 

Let us turn now to the case where equation (1) is only approxi- 
mately true and obtain a measure of the error involved in Professor 
Thurstone’s scaling method when it is applied. The difference between 
two scale values for adjacent groups is given by 


Ey. = 421 — 1Z2 = ¢X101 + Mi — 2Xa02 — Me (7) 
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The variance of Ey. is given by 


S3 = Siol + S202 — 2rS01S202 
We shall now assume that 
S101 = Soe = See 


This assumption is implied in Thurstone’s method even though equa- 
tion (1) is only an approximation. Making this substitution we may 
then write 

S; = 282(1 — r) (8) 


As an example of the use of these last formulas we may take an 
illustration given by Professor Thurstone on page 448 of his 1925 


article. For equation 40 of the Benet series, the following values 
were obtained: 


Xi=+ .277 X3s=-— .473 
M:= 4.061 M,= 4.875 
o7 = 1.333 Gs; = 1.496 
S:= 1.478 Ss= 1.318 
S, = Sw: = 1.971 S, = Seog = 1.971 
m;, = —0.293 ms = — .805 

r = .9887 


Using formulas (4a) and (4b) we find that 
Z1 = .277 X 1.333 + 4.061 = 4.43, and Z, = —.473 X 1.496 + 
4.875 = 4.17 


The average of these two scale values, or 4.30, is taken by Professor 
Thurstone as the final value. 


It will be noted that the difference 
Zr — Zs = 0.26. 


This is one of the E values given by formula (7). The value of S; may 
now be found from formula (8) by substituting S, = 1.971 and r = 
.9887 giving 
S; = 3.885 X .0226 = .088 
The probable error of S; is given by the formula 
1.349 
PE,** = “Ta Sl — r)V/3 — 3? (9) 


For n = 50, r = .9887, and S, = 1.971 we find PH,? = .012 
* The derivation of this formula will be given in the next section. 
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We may therefore write S; = .088 + .012 and conclude the variance 
could not have arisen from the fluctuations in random sampling. 

The correlation of .9887 used here is one of the highest of Professor 
Thurstone’s list of eleven, the average value being .98. Taking S, = 2 
and r = .98 and substituting in formulas (8) and (9) we find S; = 
.160 + .022 which is a crude approximation to the average error for 
the whole series. 

From these results it is apparent that Professor Thurstone is 
wrong in assuming that the linearity of the plot of X, and X2 may be 
tested by inspection. He is also in error in his statement on p. 517 of 
1927 article that ‘‘the fluctuations in scale values in our data are due 
primarily to variable or chance errors.”’ It would further appear that 
his system of weighting the ,Z values is unwarranted with such errors 
in the values themselves. 

In case the plot of X; and Xe proves to be linear by the tests given 
with formulas (8) and (9), we should then have an example of the first 
case of exact relationship, within the limits of random sampling. 
Here, again, there would be nothing gained by applying the ordinary 
scaling method to more than one group. 


Tue ProsBaBLe Error or S; 


In deriving the probable error of S; we shall use the symbol d to 
denote a statistical differential and heavy square brackets [ | to indi- 
cate the sum for all samples divided by the number of samples. Set- 
ting S; = u and S, = s we find to a first approximation that 


S = 2sds — s*dr — 2srds. 


Squaring both members of this equation, summing for all samples, 
and dividing by the number of samples, we may then write 
[d?u] s 
4 
Substituting the well-known formulas, [d*s] = s?/2n, [dsdr] = 
sr(1 — r?)/2n, and [d*r] = (1 — r?)?/n, 
we find upon simplification that 
(d*u) s(1 — r)*(3 — 1°) 
.** n 





= 4s?(1 — r)*[d?s] — 4s*(1 — r)[dsdr] + s‘[d?r] 








or 
au = rnVJ/3 — 7 





(10) 
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This formula is of course a first approximation, the higher order 
differentials having been neglected. It is also possible that the distri- 
bution of S; is not normal for extremely high values of r. Such diffi- 
culties, however, are not at all peculiar to formula (10). Thus the 
probable error of the correlation coefficient given by 


6745(1 — 
a se (11) 


vn 


is only a rough approximation for most data, and forr > .5 and n = 
50 the distribution of r is far from normal. 





SUMMARY 


1. It has been shown that in case Professor Thurstone’s assumption 
given by equation (1) is exactly true, then his scaling method becomes 
unnecessary because the scale values from all groups will be identical. 

2. A formula has been given for testing the linearity of the plot of 
X, and X». The application of this formula to relationships which 
Professor Thurstone assumed to be linear shows significant divergencies. 

3. Considered logically, Professor Thurstone’s method appears to 
be quite unnecessary in case all items may be given to the children in 
one group. If the data test linear, the method based on more than 
one group is legitimate but 1s superfluous; if the data do not test linear, 
then the method doesn’t apply at all. 

4. The above conclusions refer to the scaling of items all of which 
may be given to one group. Professor Thurstone’s method may still 
be an ingenious device for obtaining approximate values for the 
unknown o’s and M’s in case they should be required in scaling tests 
such as Binet. 

5. It is recommended that only one age or grade group be used in 
scaling tests whenever this is possible. The best practice at present is 
to use some standard group, like 12-year-olds. This procedure has 
long been in use at Columbia and Professor Thurstone is setting up a 
straw man when he attacks Dr. Thorndike for obsolete methods 
that Dr. Thorndike himself would doubtless repudiate today. 


II. COMMENT BY PROFESSOR L. L. THURSTONE 
University of Chicago 


When I first saw Professor Holzinger’s memorandum about my 
scaling method I wondered what terrible blunder I might have com- 
mitted. I have given so much thought to the problem of scale. con- 
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struction that a simple error did not seem likely. After reading the 
memorandum I was again at ease. 

Professor Holzinger has not even mentioned the problem that I 
set out to solve, nor has he acknowledged that I did solve it. I shall 
explain his errors first with reference to the problem that I set out to 
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solve and also with reference to the smaller problem to which he con- 
fines himself. In both cases he is seriously in error. 

The problem for which I devised my scaling method was as follows: 
The so called PE method of scale construction is well illustrated by the 
published data of Trabue and of Woody. This method assumes that 
the dispersion remains constant for all age and grade groups. My 
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scaling method is an improvement on the PE method in that it allows 
the dispersion to vary from one age or grade group to another. 

Professor Holzinger has a laborious proof of the obvious statement 
that ideally the scale value of a test item should be the same for all 
groups. That is exactly what we want. The scale value of an item 
should be the same no matter which age group is used for standardiza- 
tion. This is exactly the condition that the PE method does not 
fulfill. Professor Holzinger does not seem to notice the gross incon- 
sistencies in the scale values of each sentence in Trabue’s data. Does 
he really mean to defend Trabue’s procedure? 

As a specific example of his criticism Professor Holzinger takes 
sentence No. 1 in the Trabue scale, apparently defending the PE 
method of scale construction which I have tried to improve. Look at 
the accompanying figure in which I have shown graphically the scale 
values of sentence No.1. I have included in this graph both the scale 
values given by Trabue for the PE method of scaling and my own 
scale values for the same sentence. Note that for the PE method the 
scale values of sentence No. 1, as determined by the different age 
groups, jumps over a range of 5.5 PE which is over one-half of the 
whole range of mean performance and it is one-third of the whole range of 
the scale! Does Professor Holzinger propose to defend ascaling method 
with such huge errors? He avoids the question. That was the very 
problem that I set out to solve. The fluctuation in scale value for this 
same sentence in my scaling method is from 2.29 to 3.23, a range of 
0.94, which is about 7 per cent of the whole range. Will Professor 
Holzinger admit that my scaling method fits the data better than the 
PE method of scaling for the very sentence which he selects for his 
criticism, and will he make this admission similarly for the other 
sentences in the Trabue scale? 

In his summary the first point is that when my scaling equation is 
exactly true, then the whole scaling method is unnecessary because the 
scale values from all groups will be identical. This statement is quite 
correct. The absurdity of Professor Holzinger’s statement is seen in 
the fact that test data are never perfect. If test data were perfect, I 
would not have written that equation, nor would I have devised the 
scaling method. One of the problems of scale construction is to get 
the best scale value for each test item out of the data at hand with the 
realization that none of the data are perfect. This point of Professor 
Holzinger has no application to any scaling method. As a criticism 
of a scaling method it is ridiculous. 
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His second point is to criticize the linearity of the plot of 7: and 2. 
Look at the diagrams. They are Fig. 3 in my first article! and Fig. 4 
in my second article? on this subject. You have three alternatives, 
namely: (1) The plot is linear, (2) the plot is curvilinear, and (3) the 
points scatter over the diagram so that no function can be made out at 
all. Look at them again. If you don’t admit that they are linear, 
then you must claim that they are curvilinear or that the points scatter 
so that no function can be made out. My scaling method leaves the 
investigator quite free to stop scaling whenever any of the diagrams 
do not satisfy his own criterion of linearity. I have explicitly said 
this in my previous publications. In the cases that I have shown, the 
linearity is striking. After all, the proof of the pudding is the eating 
thereof. I have shown that my scaling method in which these partic- 
ular diagrams were treated as linear fits the actual data far better 
than the PE scaling method. Will Professor Holzinger admit that? 

When he applies his criterion to my calculations it should be noticed 
that he does not apply it also to the data of Trabue, or of Woody, for 
the PE method. Only by so doing can he make any comparative 
judgment about the two scaling methods at issue. Is he willing to 
acknowledge the results? 

Professor Holzinger makes much fuss about the obvious fact that 
the PE method of scaling and my scaling method are identical as long 
as one is confined to one age group. Nobody has ever denied that. 
Why not carry the comparison further to the whole scale for all the 
age groups? He avoids that. 

But Professor Holzinger prefers to standardize tests on a single age 
group. Let us turn now to this simplified proposal. Suppose that the 
data of Trabue were standardized on only one grade group and that 
the resulting scale were used for plotting the other grade distributions. 
This is Professor Holzinger’s simplified procedure which is as old as 
educational scale construction itself. Now please notice that the very 
same errors of measurement that worry him would be involved in his 
own procedure for scaling the Trabue data. The discrepancies in the 
experimental proportions for the several grade groups are not wiped 
out by merely calculating their probable errors. They can be minim- 





1A Method of Scaling Psychological and Educational Tests. Journal of 
Educational Psychology, October, 1925. 

2The Unit of Measurement in Educational Scales. Journal of Educational 
Psychology, November, 1927. 
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ized by making use of all the data available and that is exactly what I 
have done. 

If Professor Holzinger determines scale values by the experimental 
proportions of right answers in one age group only, and if the other 
distributions are plotted on this scale, then he is violating an element- 
ary principle in the adjustment of observations. He is assuming that 
the proportions for the group chosen as a standard are free from obser- 
vational error and he throws all of the adjustment on the other groups. 
In the usual least square procedure this assumption can be made pro- 
vided that the independent variable is assumed free from error and that 
all of the observational errors are assigned to the dependent variable. 
In the present problem there are observational errors in all the groups. 
Will Professor Holzinger acknowledge that in scaling the most common 
forms of test data that are presented as norms for several age groups, 
his simplified procedure does violence to an elementary statistical 
principle in the adjustment of observations? 

The third criticism states that my scaling method is quite unneces- 
sary in case all the items may be given to the children of one group. 
His recommendation is that scaling be done on only one age or grade 
group and he endorses McCall’s suggestion that 12-year-old children 
be used for scaling test data. There are a few practical difficulties 
that would be embarrassing. I wonder if Professor Holzinger is willing 
to acknowledge these difficulties as “limitations” of his simplified 
procedure. Here are a few of them. 

1. The actual available test data such as those of Trabue, Woody, 
Burt, and scores of others, are divided among many age groups. Thus, 
if the author of a test presents his norms for 200 children in each of five 
successive age groups, 1000 cases in all, then Professor Holzinger 
would be confined in his standardization to one fraction of the available 
data, namely to the records for one age group only. Perhaps Pro- 
fessor Holzinger will admit that in such a situation, and it is the typical 
one for nearly all published norms, the scale values will be more reliable 
if they can be determined by all the available records rather than on 
the records for a single age group. That is precisely what I have done. 
It is possible to confine oneself to only one of the age groups if all of the 
items can be scaled on one age group, but the resulting scale values 
will necessarily be less reliable than when the scale values are deter- 
mined from all of the data. 

2. Professor Holzinger would propose to use a large number of 12- 
year-old children instead of a smaller number of children for each of 
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several ages. It just so happens that test data are not ordinarily so 
collected and that is one of the reasons why I devised a scaling method 
to handle the data as we actually find them. But if you set out to get 
1000 12-year-old children for the purpose of standardizing a test, it 
will be necessary for you to enter classes in the grades and in the high 
school where these 12-year-old children are mixed with children of 
other ages. It is administratively rather awkward to get these chil- 
dren and to discard all other children in the same classes. At any 
rate, the generally published normsare not giventhatway. Myscaling 
method enables one to obtain scale values of high reliability by using 
the records for all age groups as actually published. 

3. Now suppose that the tests are to be standardized on 12-year-old 
children. Then you would present questions like these to 1000 12- 
year-old children. ‘The _ is barking at the cat.” ‘We like 
good boys —.-___ girls.” Add the following: 2+3 =~. “Put 
your finger on your nose.” ‘Show me your nose.” ‘ Where is your 
nose?” ‘Are you a little boy or a little girl?” Professor Holzinger is 
welcome to the pleasure and the embarrassment of giving these test 
items, or others like them, to 1000 12-year-old children! The point 
here is that if we limit ourselves to one age group for standardization, 
the proportions of correct answers will be very high for easy test items 
and very low for the difficult items. These proportions of correct 
answers, near unity and zero, give scale values that are very unreliable. 
Will Professor Holzinger admit that this is a “‘limitation”’ of his simpli- 
fied proposal? 

4. Now suppose that Trabue’s sentences have been given to a 
large group of subjects of one age group only. It is not unlikely that 
some sentences will be found which are answered correctly by nearly 
all of the subjects, and that a few sentences will be found in which nearly 
all of the subjects fail. Then, I suppose, Professor Holzinger would 
throw out from this scale these very easy and very difficult items. 
Very well. That canbe done. Does Professor Holzinger admit that 
this constitutes a “limitation” to his simplified method of scaling? 
The very fact that proportions of correct answers for all of the items 
in the test for a single age group are found to be reliably distant from 
unity and from zero indicates unmistakably that the test series has a 
limited range of difficulty. 

5. Perhaps Professor Holzinger would prefer to have one scale of 
test items for several upper age levels and another simpler scale for 
several lower age levels in order to avoid the difficulty to which I have 
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just called attention. These two scales could be standardized inde- 
pendently. But it might be a legitimate inquiry to study the continu- 
ous growth of language ability through all the age levels. Then it 
would be necessary to express both scales in the same terms. This is 
just where the customary PE method breaks down because it assumes 
constant dispersion in all age groups. Professor Holzinger might find 
my scaling method useful for such a problem which he could not solve 
with the method he recommends. Does he admit this as a “‘limita- 
tion” to his simplified procedure? 

The final paragraph shows that Professor Holzinger does not know 
what the scaling problem is all about. He says that “‘Thurstone is 
setting up a straw man when he attacks Professor Thorndike for 
obsolete methods that Thorndike himself would doubtless repudiate 
today.” Ihave not attacked Thorndike for anything. I have offered 
an improvement on the scaling method which has been devised by 
Thorndike and his students. I have given page references to the 
literature wherever I have discussed the PE method of scaling. Has 
Thorndike ever repudiated the PE method of scaling, and has he 
offered any other method of scaling that does solve the problem for a 
wide range of difficulty and with variations in the dispersions of the 
successive age groups? There is no particular reason why he should or 
must doso. But if he has not done it, then I am not setting up a straw 
man. Consider the current textbooks in educational measurement 
such as those of McCall and of Trabue or Garrett’s text in Statistics in 
Psychology and Education. If Professor Holzinger will take the trouble 
to read these current textbooks he will find that the PE method of 
scale construction is described in these books in good faith without any 
mention of its fallacious assumption to which I have called attention. 
Are these current textbooks straw men? It is my wish and expecta- 
tion that Professor Thorndike may approve of my efforts to clarify the 
logic of scale construction and, more particularly, of the present scaling 
method. But suppose that he disapproves of it entirely. Then, 
again, I am surely not setting up a straw man! As far as I know, no 
one has previously described the scaling method that I have proposed. 
Well, then, why the accusation about the straw man? 

Mr. Holzinger’s proposal is a reversion to the simplest form of 
solution to the scaling problem. It is so obvious that any one who 
thinks seriously about scaling problems must come on that trail very 
soon in his efforts, but it has “limitations” that I have enumerated 
and which constitute the motive for finding a more universal solution, 
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free from these handicaps. The solution that I have offered is not 
perfect but it is miles better than the PE method of scaling and it is 
free from the handicaps of scaling on a single group. 

When a test is constructed merely for administrative purposes in a 
school, no refined scaling of the items is necessary. If a spelling test 
is required for fifth grade children, it is sufficient that a good long list 
of words be available, selected from the general range of difficulty 
corresponding to the curriculum and arranged conveniently for pre- 
sentation and scoring. The children will be arranged in rank order 
sufficiently well for all practical purposes. It is for theoretical prob- 
lems in psychology and education that the analysis of scale construc- 
tion becomes valuable. 

If these criticisms had come from a student or from a novice I 
might have ignored them. But when vehement criticism on this level 
of comprehension comes from a man of Professor Holzinger’s sophisti- 
cation in statistical theory I have felt called upon to make my state- 
ment of his mistakes direct, explicit, and more than ordinarily frank. 
I have tried to explain his errors in conversation. He insists on carry- 
ing them into print. 


Ill. REPLY TO PROFESSOR THURSTONE 
KARL J. HOLZINGER 


I may say at once that I still think I am correct in my criticism of 
Professor Thurstone’s Scaling Method, and that I am not seriously 
in error on any point. I shall attempt to answer his statements para- 
graph by paragraph and to save space will not repeat his statements. 

43. I showed in my article that Professor Thurstone’s method is 
not an improvement for data such as in the Trabue and Woody tests. 

44 and 5. I consider my mathematics shorter and less laborious 
than Professor Thurstone’s. No, I don’t defend Trabue’s procedure 
based on several groups. I, of course, noticed the variations in scale 
values shown in Professor Thurstone’s figure for sentence 1. These 
variations have nothing whatever to do with my comment: which 
refers to a single grade. My point is that a single grade will furnish 
as reliable scale values as Thurstone’s elaborate procedure based on 
several grades. 

§6. The only reason I discuss the case of the scaling equation as 
exactly true, is that I wished to consider every possible case. The 
remainder of the comment seems to me irrelevant. 
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77 and 8. I do not believe the linearity of a set of data can be 
determined by urging the reader to “look” again and again any more 
than the normality of a distribution can be determined by looking at 
the data. The only reason I did not apply my formula to the Trabue 
and Woody data was it seemed to me quite unnecessary. Ishowed that 
several of Professor Thurstone’s plots were not linear. If the test is 
applied to Trabue data and the trend found to be linear, I should then 
admit that his method is legitimate, but still entirely wnnecessary 
because a single group will give as good results. 

49. See answer to paragraphs 4 and 5. 

410. I admit error of measurement, but do not see that this para- 
graph bears on my criticism. I do not claim any new simplified pro- 
cedure and I am not trying to ‘“‘wipe out” anything by calculating 
probable errors. 

411. I do not propose plotting any distributions on the scale of one 
selected group and I do not propose to use the theory of least squares, 
so I do not see what bearing this paragraph has on the discussion. 

9/12 (1, 2,3, 4,5). Asstated in my article, I would not reeommend 
scaling a test on a smaller number of cases than by the Thurstone 
method. I would try to secure several hundred children (say 12-year 
olds). 

I admit that it is easier to get data from several age groups than from 
one, but I still believe it is usually possible to get a sufficient number of 
cases from one age group for scaling purposes. If this were not pos- 
sible, I should combine two or three adjacent age groups and scale from 
the total. There is no logical reason for preferring an age range of 
12-13 to one of 11-13 or 11-14. 

Professor Thurstone’s examples such as “The ______ is barking 
at the cat’ are amusing but have absolutely no bearing on my criticism. 
I explicitly ruled out such cases in my criticism and admitted in my 
summary that this method might apply to such instances. 

The reference to Trabue’s data can be answered explicitly. On 
page 62 of Professor Trabue’s study we find that for the Grade VIII the 
easiest item was passed by 98.4 per cent while the hardest item was 
passed by 2.4 per cent of the children. Any errors in scaling at the 
extremes would probably not be greater than the errors in Pro- 
fessor Thurstone’s scaling method with several groups and the same 
number of cases. 

413. I assumed that McCall’s T-score methods were a repudiation 
of the earlier methods at Columbia, but I may be wrong. 
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414. My belief with regard to the scaling of individual items is 
that ordinarily it is not worth while at all. I also believe that if one 
must scale the items he should use one group and not several as in 
Professor Thurstone’s method. Professor Thurstone’s method may be 
miles better than Professor Trabue’s method based on all groups, but 
I don’t think it is any better than the simple method based on one 
group in the situations described in my article. 

415. I agree that the refined scaling of items is often unnecessary. 
Professor Thurstone’s system of weights, however, shows very clearly 
that he intends his method to be useful in obtaining final scaled values. 
I admit the theoretical value of his procedure, but deny its practical 
value. 

I am of the opinion that a critical attitude toward our present 
methods in tests and measurements is very necessary for the growth of 
educational science. A too ready acceptance of new methods without 
clear recognition of their limitations as in this case, will surely not lead 
to progress. 
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SOME RELATIONS BETWEEN AMOUNT OF SCHOOL 
TRAINING AND INTELLIGENCE AMONG NEGROES 


ROBERT A. DAVIS, JR. 


Baylor University, Waco, Texas 


PuRPOSE AND METHOD 


The results here reported came to light in an investigation whose 
general purport was to study the relationship existing between general 
intelligence and the number of years students have had in school. 
Comparisons were especially desired among different social groups 
including students from Negro schools. 

The school for which data are reported is a Negro Normal and 
Industrial School, a boarding and day school in the South under the 
management of the American Missionary Association of the Congre- 
gational Church. All the teachers are white with the exception of a 
few, and for the most part well trained. The high school work cover- 
ing four years, is accredited by the State department of education and 
accordingly, each student upon graduation, received a State teacher’s 
certificate without examination. 

The Terman Group Intelligence Examination, Form A, was given 
to 222 of these students, Grades VIII to XII inclusive, the 15th of 
May, 1926; and the data relating to the ages and number of months in 
school (including the current year) were obtained by the principal 
from the school records. The mental ages from which the intelligence 
quotients were derived are in terms of the Stanford Revision of the 
Binet Scale. 

It is very probable that this test is not well adapted to Negro 
students. In fact, there is great doubt if any intelligence test as yet 
devised is well adapted to the Southern Negro. The results are 
presented as a part of this general study because of their bearing on — 
school training. 


RESULTS 


The presentation of the results in this system begins with Fig. 1, 
which gives the distribution of Terman Intelligence Quotients of the 
222 students, Grades VIII to XII inclusive, showing a median of 78. 
This is followed by Fig. 2, which shows the distribution of these same 
students according to the number of months they had previously 
attended school, and the standard number of months for completing 
the various grades. It will be observed from these figures that the 
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| Fic. 1.—Distribution of Terman’s IQ’s of 222 negro normal school students. High 
school grades 8-12 inclusive. 
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Fic. 2.—Distribution of normal school students according to number of months 
attended school including the current year (according to school records). 
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Fie. 3,—Mental abilities of normal school students according to number of months in 
school. 
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median IQ is 78 when it would normally center around 100 for white 
children in these same grades; and the median number of months in 
school is 70, when it would usually be 90 months for these same grades 
in all school systems where nine months’ school terms are maintained. 

The relation of school training to intelligence scores obtained by 
these students is shown in Table I, which gives the distribution of the 
Terman IQ’s according to the number of months in school previous to, 
and including the current year. This is followed by Fig. 3, which 
shows graphically the three quartile points in these distributions. 
The length of each bar, therefore, indicates the range of scores necessary 
to include the middle half of all pupils belonging to each month group. 

In Table II is found a condensed summary of the median ages and 
months of schooling possessed by these students, together with the 
standard ages and months for all the grades considered. No attempt 
was made to distribute the intelligence quotients with reference to 
grade location because of the relatively small numbers of students 
represented in the upper grades. Special attention is called in this 
table to the regularity with which this group contains older pupils, 
who have had relatively less opportunity to secure an education. 
Attention is also called to the fact that neither the school training, as 
measured by number of months in school, nor the intelligence quotients, 
in any way approximate the standards derived from children who have 
been allowed a greater number of years in school, with longer school 
terms than have the students listed here. 


CONCLUSIONS 


The results presented in this investigation do not justify very 
definite conclusions. However, certain observations appear to be 
evident. The study calls attention to the need for different kinds of 
intelligence examinations, or at least the establishment of different 
norms in appraising the mental capacity of Southern Negro students. 
When and only when we have equalized the character as well as the 
amount of education possessed by the colored and white races can we 
draw distinct lines between them with reference to intelligence. When 
intelligence scores are distributed according to amount of school 
training, the influence of increased educational opportunity is easily 
shown. When amount of school training alone is considered, the 
educational training is very meager when compared with the standard 
for the whites. Attention has been called in Table II to the regularity 
with which this group contains older pupils who have had relatively 
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less opportunity to secure an education. If we accept the view that 
intelligence tests measure both native endowment and school training, 
the influence of the lack of schooling is shown in this rather select 


group of students studied. 


Tas_e I.—DistTriBsuTION oF TERMAN’S IQ’s or Necro NorMat Scuoou Stupents 
AccoRDING TO NuMBER OF Monrus 1n Scuoou (INcLupING CURRENT YEAR) 
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Median 
Grade Number Median Standard number of Standard 
i cases age age! months in months? 
: school 
8 89 16.75 14.0 53.5 72.0 
9 63 17.40 15.0 63.0 81.0 
10 40 17.28 16.0 80.0 90.0 
11 21 17.75 17.0 85.5 99.0 
12 9 19.18 18.0 100.0 108 .0 
1 Standard for grade. 





? Standard for completing grade. 


ofte 
the 

1.€., 
upo 
que: 
jud, 
jud; 
an a 
tion 
is ug 
able 
whi 
out 

ave} 
vide 
aval 


vari 
able 
eaclk 


popt 
as ft 


Kell 
whic 
whic 
cient 








at 


g, 
ct 





A FORMULA FOR FINDING THE AVERAGE INTER- 
CORRELATION COEFFICIENT OF UNRANKED RAW 
SCORES WITHOUT SOLVING ANY OF THE 
INDIVIDUAL INTERCORRELATIONS 


HAROLD A. EDGERTON 
AND 
HERPERT A. TOOPS 
Ohio + Ate University 


In making rating scales and in constructing criterion scores, we are 
often confronted with this question: ‘‘Do I have enough judges that 
the composite scores for each thing judged are statistically reliable; 
i.e., that the composite scores will come out approximately the same 
upon an independent repetition of the experiment?” To answer this 
question, one needs to know the average intercorrelation among the 
judges and then resort to Brown’s formula to estimate the number of 
judges necessary to attain the minimal reliability desired. Often only 
an approximation to the average intercorrelation of judges, the correla- 
tion between the ratings of any two of the judges, “chosen at random,”’ 
is used ; or instead, an average of two or three random, or rather “‘avail- 
able,” intercorrelations from the total number of intercorrelations 
which might be solved. By the method described below, and with- 
out being compelled to employ makeshift approximations, the true 
average intercorrelation coefficient of the judges may be found pro- 
vided, however, that the gross scores in the several n variables are 
available. 

Keliey! offers a method of finding the average intercorrelation of n 
variables by use of a method of ranking the scores in the several vari- 
ables. Its use, however, is laborious inasmuch as ranking the scores of 
each variable is at best a laborious process, especially when N, the 
population, becomes at all large. It also involves all the assumptions 
as to rectilinearity and rectangular distributions involved in the 
Spearman rho. 

The rank method has another disadvantage. When solving by 
Kelley’s formula for the average intercorrelation coefficient of measures 
which are ranked, the conventional use of averaged ranks for scores 
which are tied, results in the obtained average intercorrelation coeffi- 
cient being probably higher (if any such scores are so averaged) than 


1 Kelley, T. L.: “Statistical Method.” Macmillan, 1924, p. 218, formula 172. 
131 





er IR 


— 


NEE 





vp TT = 
a 


a] 
: 
' +i 
‘es 








SE FE eS ate dee tS one 


ee 
Y Fie 


SS ee OE SS eee eS SEE Ee 
zs ‘ ie iS =f iy it, 





132 The Jonrnal of Educational Psychology 


the true average intercorrelation coefficient. The use of a formula 
making use of the actual scores, without ranking, overcomes this diffi- 
culty by avoiding it. Computational errors in this new gross score 
formula are compensative in nature rather than cumulative. The 
formula derived below has the additional merits of saving much time 
over the ranking method. 


DERIVATION OF THE FORMULA FOR SOLVING THE AVERAGE INTERCOR- 
RELATION COEFFICIENT WITHOUT COMPUTING ANY OF THE 
INDIVIDUAL CORRELATION COEFFICIENTS 


Let 


‘ie = te in which 2; = = and Zz: = = (1) 
1 2 


Then the average intercorrelation coefficient, r, of n variables is the 
sum of all the possible correlations (numerator of equation (2) below) 
divided by their number (denominator of equation (2) below), or, 


(rie +ristruat:-- + T(n—1)n) 














mee n(n — 1) (2) 
2 
Substituting equations analogous to equation (1) in equation (2), 
— 1 22122 Dees D2124 22 —~12 
r= iy es ees sige ot — 3 
s(n — 1) 
Or, factoring out ve 
“ iw. 
Anker +» D(2ize + 2123 + 2iza t+ ++ + +2ni2zn) (4) 
N=(n — 1) 
2 
Now, , 


D(eiee + 2123 + 21iz4 + > + > + n-12n) = 
Z(aiteetesti-- ta)? Teatateat-:: + +2n) 
2 2 


which, when substituted in (4) gives, 


r= Wawa 7p | Be: + te +20 + =e hep + 2n)? - 











Vateateat:c: +2)| (5) 
since 2z? = N, it follows that, Sai +a+a+-:°: +2) =nN 
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Also, 


os 
‘Nn(n—1) (n—1) 





nN 


and, 


- Walaa jy 2041 + 22 + 23 + i TF +2)? - + 
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which is the formula required if couched in terms of standard scores. 
Equation (7) can now be put in a form so we can deal with the gross 
scores rather than standard scores, or deviations; and finally we may so 
arrange it that step scores may be used rather than gross scores. (In 
case of step scores, read X’ in place of X, a’ in place of «, M’ in place 
of M, in all formulas below.) 


Substituting, z = Sa = S 2 5 in equation (7), 


Cg Co Co 
te tee Oh ee 
* Gale = 1) 2| (21 + 2 4 X94 +) 
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Now, let, 
(41424... 4%) = s® (9) 
01 02 on go 
and, 
(SeMhy:.. -ewt (10) 
oi 02 n go 


Substituting in equation (8) and expanding, 
dae vate 1)’ | =| (2) | Bi 2z| s(*) || s(¥)| 
emf) an 

















but, 
2s(*) ss (2% giBhen 5 2Xe) 7 
Co 01 02 On 
u(t 4 Ms fee e -) = v| s() (12) 
01 02 On o 


and hence, 


waaay @)) MEE) om 


* S means a horizontal summation in Table II for each or any individual. 
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which is the formula desired, couched in terms of gross standard 
scores. 
SOLVING THE FORMULA WITH THE AID OF A CALCULATING MACHINE 


Given the problem: To find the average intercorrelation, r, of the 
ratings of n = 6 judges, X1, X2, X3, X4, Xs, and X¢, judging N = 10 


persons, A, B,C, . . . J, rated on a certain trait. The steps are 
as follows: 

1. Asin Table I, find 2X, and 2X? and M, for each variable, using 
the formula, M, = - Check these by having a second person obtain 


all squares and sums independently. 
2. Solve for o of each variable using the formula: 


N 2X? — (2X)? 
, = EET GED, 








2 


and record in Table I, as indicated. 
3. Find reciprocals of o1, a2, ¢3, etc. Check by multiplying sg 
which result should yield either 1.0000 . . . or 9999 .. . 


4. Find a for each variable. Record in Table I as indicated. 


5. Set + in the keyboard of the calculator and multiply it success- 
ively by the successive X, values of the different persons judged by 
Judge 1. This gives the needed values of a which are to be recorded 


in Table II, column 1. | Do the same for Ls Xo, e. X3,° °° Bi. X¢, 
02 C3 06 


etc., recording the quotients in columns 2 to 6 respectively. 


6. Add the = scores by columns, recording the sum at the foot of 


each column, (Table II), thus 2(**) = 23.5340; 2( ) = 22.9872, 


o 
etc. 
7. To check the sums found by 6: 
(a) First prove the essential correctness of the entries in the six 


columns successively, thus 2(% ) = 2. Thus in column 1, 2>(*) == 
=X, 48 





23.5340. This should be equal to >, = 3.0306 = 23.5340. In col- 
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Xe TX 48 


umn 2, 2(**) = 22.9872; while — an 22.9874. The 


slight difference of .0002 is due to dropping decimals and is therefore 
negligible. 

(b) Find the sum of the six column sums 23.5340 + 22.9872 + 
- + + + 29.9999 = 162.2501. This sum (162.2501) must check 
with the result of step 9 below. 


8. Add the ~ scores of Table II by rows, securing a s— for each 


individual, recording the sums in the column headed ss, (Table II, 


column 7). 
9. To check the ten sums just found and entered in column 7: 
Add the ten sums, 


21.3182 + 17.1158 + --- + 14.1596 = 162.2501 


TaBLE I.—FinpInc =X anv 2X? anv My, ao, , x ror Eacu JupGE 













































































Judgments and squares of judgments made by judges 1, 2, 3, 4, 5, and 6 
Persons 
judged 
Ms i Set Be i Ze i Be | BO) te F Bt ost BE Be | Be 
A ~ 64 5 25 7 49 6 36 3 9 6 36 
B 4 16 2 4 7 49 5 25 5 25 4 16 
Cc 7 49 6 36 5 25 3 9 5 25 5 25 
D 4 16 5 25 4 16 4 16 2 4 3 9 
E 2 4 1 1 1 1 3 9 2 4 2 4 
P 6 36 5 25 8 64 6 36 4 16 5 25 
G 6 36 7 49 5 25 3 9 3 9 5 25 
H 5 25 8 64 7 49 5 25 6 36 4 16 
I 1 1 3 9 2 4 5 25 3 9 2 4 
J 5 25 6 36 2 4 3 9 4 16 3 9 
10=N| 48 | 272 48 | 274 48 | 286 43 | 199 37 | 153 39 | 169 
Sums | 2X: | 2X? | 2X: | 2X} | 2X3 | LXF¥| TXa | SXF] TXs | SXF | TXe| TX, 
Mx 4.8 4.8 4.8 4.3 3.7 | 3.9 | 
e 2.0396 2.0881 2.3580 1.1874 1.2689 1.3000 
: 4902922 4789043 4240882 8421761 . 7880841 . 7692307 
Me 2.3534 2.2987 2.0356 3.6214 | 2.9159 | 3.0000 
s™ ~ 16.2250 
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Taste II.—Tasie To Assist IN FINDING z| 8 (=) 



























































1 2 3 4 | 5 6 | 7 8 
Persons 
judged | Xi Xs Xs Xs Xs Xs sx (s=)’ 
G1 o2 3 4 O5 C6 gc Co 
A 3.9223) 2.3945) 2.9686) 5.0531) 2.3643 4.6154! 21.3182| 454.4657 
B 1.9612] 0.9578) 2.9686) 4.2109) 3.9404) 3.0769, 17.1158) 292.9506 
C 3.4320] 2.8734) 2.1204) 2.5265) 3.9404) 3.8461) 18.7388) 351.1426 
D 1.9612] 2.3945) 1.6964) 3.3687) 1.5762) 2.3077) 13.3047) 177.0150 
E 0.9806) 0.4789) 0.4241) 2.5265) 1.5762) 1.5385) 7.5248) 56.6226 
F 2.9417) 2.3945) 3.3927) 5.0531) 3.1523) 3.8461) 20.7804) 431.8250 
G 2.9417] 3.3523) 2.1204) 2.5265) 2.3643) 3.8461) 17.1513) 294.1671 
H 2.4515) 3.8312) 2.9686) 4.2109) 4.7285) 3.0769) 21.2676) 452.3108 
I 0.4903) 1.4367) 0.8482) 4.2109) 2.3643) 1.5385) 10.8889) 118.5681 
J 2.4515) 2.8734) 0.8482) 2.5265) 3.1523) 2.3077) 14.1596; 200.4943 
2(*) 23. mnicien . 9872/20 . 3562/36 . 2136/29 . 1592|29 . 9999) 162 . 2501/2829 . 5618 
2X (fora |o3 | 5340|22. 987420 3562/36. 2136/29. 1591|30..0000 
o check) 
which checks exactly (as it always should) with the results of step (7b) 
above. ) 
. ,M, M Seat ae 1 
10. Fin ma ring ete., by multiplying ~ - M,, — + Mz, etc. and record 
1 


the results in Table I. 
11. To check (10), each = should equal its corresponding 7. 


10, ada My Moy... 4 
01 02 


on 
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In the accompanying problem, it is 16.2250 (Table I). 
13. To check step, (12): su should equal the sum of the expression, 


(in case all >X-quantities are positive), 


or, 


=X, 


Noi 
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which will be known as sv. 


These denominator expressions, N2X* — (2X)?, were all found in 
computing the various o’s. 
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14. Square each = found in column 7 of Table II, and record the 


2 
square in column 8, which is headed (s*) . Watch decimals carefully! 


To check this operation, have another person independently compute 
the squares. 
. : - . . 
15. Find 2(s*) . This is the sum of the entries of column 8 
of Table II, and in this case is 2829.5618. 
16. All that remains now is to substitute the numerical values 
already obtained in equation (13). 


r= wat AE) [ACT] om 


1 i _ 
* = (0) 6c) {2829.5618 — (10)(16.2250)*} 











oul = 


1 


The average intercorrelation secured by actually working out and 
averaging all the 15 individual intercorrelations of the above example, 
done in this instance for a verification of the formula, is also .4569. 


The time required by the usual methods for solving all the possible 
“in — >) intercorrelations increases geometrically as the number of 
variables. By using formula (13) for finding the average intercorrela- 
tion coefficient, the work increases only arithmetically with the num- 
ber of variables. Thus the larger the number of variables involved 
the greater is the relative saving of labor which may be effected by 
using the new formula. 

To complete the above problem, let us decide two things: (1) how 


reliable is the composite judgment, (=), obtained for each man 


respectively in column 7 of Table II; (2) how many judges would be 
necessary to give a certain minimally desirable reliability coefficient 
of judgment, say .90? 

On the average, one judge correlates .4569 with another; therefore 
this value is substituted for raz in Brown’s formula, 








aa NT aB 
T(nX)(nX) — 1 + (n mein 1)rap (14) 
6X 4569 _ 2.7414 | 
1(6 judges) (6 judges) = i +. (5) (.4569) es 3.2845 = .8346 (15) 
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That is, the reliability of the composite scores of column 7 of Table II 
is .83. Six judges are accordingly hardly enough. 

We shall now accordingly solve our second problem. If in equa- 
tion (14) we let riax)(ax) = .90, and ra, = .4569, and then solve for n, 
we have, 

a n(.4569) 
~ 1+ (nm — 1)(.4569) 





.90 


Whence, solving for n, 
n = 11 judges. 


That is, 11 judges’ composite scores, derived similarly to those in 
column 7, Table II, may be expected to correlate with the composite 
judgments of a second set of 11 judges to the extent of .90. A relia- 
bility much higher than .90 in the present instance is purchasable only 
at a rather disproportionate cost in number of judges necessary to 
secure such reliability. For example, to obtain a reliability of compos- 
ite judgment of .95, there would be required 23 judges, as may be seen 
by supplying the values, rinx)inx) = .95 and raz = .4569 in the follow- 
ing formula, derived by solving literally for n in formula (14), above. 








_ T(nx)(nxy * (1 — Tap) 
2 th Tap(1 — rmx) (nx)) (16) 
Or, 
oF .95(1 — .4569) 
.4569(1 — .95) 


_ .95(.5431) _ .515945 
~ 4569(.05)  .022845 


For many purposes 23 judges would be desirable. Perfect reli- 
ability of the composite score is, of course, possible only by having an 
infinite number of judges. 





= 23 judges 
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An InTRODUCTORY CoURSE ON PsycHOLOGY AS THE “SCIENCE OF 
ADAPTIVE BEHAVIOR” 


Psychology—Its Methods and Principles, by Fleming Allen Clay Perrin, 
and David Ballin Klein. New York: Henry Holt and Company, 
1926. Pp. X + 387. $2.25. 


This text is the outcome of the authors’ efforts to develop an 
introductory course in psychology on the basic concept that psychology 
is the “‘science of adaptive behavior.” To say that it is wholly 
Watsonian is a fallacy, but it makes one signal contribution in its 
attempt to present elementary psychology from a viewpoint that is 
distinctly more behavioristic than are most available texts. The 
extent to which this tendency is in evidence is illustrated by such 
viewpoints as these: ‘‘ (1) Overt adaptive behavior, (and) or (2) mental 
activity responsible for overt adaptive behavior’’ are adequate and 
equally servicable as definitions and concepts of intelligence (pp. 136 
and 319); and (p. 360) “All traits of personality, including the traits 
designated as mental and motor abilities, are thus amenable to con- 
trolled investigation.”” The authors, however, disclaim adherence to 
any school and indicate in the preface that data are selected from both 
introspective and behavioristic material, and claim statistical con- 
sistency as the criterion for the verification of psychological phenomena. 
In spite of this the effect of the book, by its extended treatment of 
animal behavior and its extreme emphasis upon objective experimenta- 
tion, is without question behavioristic. It makes little mention and 
certainly ostracizes by neglect the conventionally accepted facts of 
sensation, perception and imagery. Treatment of attention, mental- 
set, instincts, and the like might just as well not have been mentioned 
rather than to have been so inadequately treated or ignored as triviali- 
ties. No emphasis is given to ideational learning, but many pages are 
devoted to motor learning and conditioned reflexes. Nine theories of 
learning are described, but none is adequately treated. In such an 
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array it is to be noticed that Lashley’s experiments have not been 
included. Itis very probable that this section of the book will be very 
difficult for elementary students. 

The second contribution of the book lies in its very successful 
attempt to avoid atomism in psychology. The book, of course, is not 
divided into the traditional chapter headings; the entire content is 
assembled under six headings and one of these is introductory in pre- 
senting the problems and methods of psychology. The student who 
uses this text in elementary psychology will have no atomistic concep- 
tion of the subject. 

The book is well written, is concise, is rich in reference to classical 
experiments and lacks the scorching sarcasm of some behavioristic 
writing, is clear in delineation, and is not over-burdened with literal 
quotations from sources. The diagrams are generally good; some of 
them (e.g., p. 50) are so schematic as to be of little value. Satisfactory 
references appear at the ends of the respective chapters. 

The text is a commendable and acceptable contribution to the 
increasing number which are available for elementary courses. It will 
be particularly popular with those who wish to have a minimum of 
structuralism and who feel an attraction toward behaviorism. It will 
be valuable for supplementary reading by students of fairly substantial 
courses in general psychology, and can be used to integrate atomistic 
concepts. Extensive use of this book is predicted by those who feel 
more keenly the need of stressing objective fact and objective method 
than that of integrating academic psychology with every-day mental 
experiences. Epwin Maurice BaILor. 

Dartmouth College. : 





A MANUAL OF MENTAL TESTS 


A Manual of Individual Tests and Testing, by Augusta F. Bronner, 
William Healy, Gladys M. Lowe and Myra E. Shimberg. Boston: 
Little, Brown and Company, 1927. Pp. X + 287. 


This manual appears as Judge Baker Foundation Publication No. 
4. It is built upon the fundamental idea that mental testing is worth 
while, and that the major weakness of modern testing is that it is 
being done too hastily, tooincompletely, and that it lacks thoroughness. 
It emphasizes the necessity of combating the uncritical acceptance of 
very narrow and simple measurements and urges the use of a much 
wider range of tests. It therefore stresses individual rather than 
group testing. Chapter two repeats emphasis upon the need for 
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favorable testing conditions and deserves commendation for the good 
presentation, although no ideas new to this phase of the work appear. 

Part two presents a compilation of so-called tests of special abilities, 
some of which have previously been unpublished. The large scope of 
the work is suggested by the classification and the number of titles 
under each. Under Language and Ideational Tests 31 are given 
including such tests as arithmetical series, cause and effect, essential 
differences, proverbs, silent reading, picture completion, syllogisms, 
opposites, word building, etc. Twenty-four tests are included under 
the heading of Memory and Learning Tests; 13 under the heading of 
Mechanical Assembly tests; 22 Form Board and Construction Board 
Tests; 5 other Non-Language Tests which include cancellation, 
identification of forms, maze, slot maze A, and tapping; and a mis- 
cellaneous group of 31 tests which have not been standardized or for 
which norms on a minimum of at least 50 cases are not available. 
In all there are presented 126 tests for which there is given a descrip- 
tion of materials, directions for administration, method of scoring, 
and norms. Part three is devoted to general comments and some 
interpretation of each of.the tests presented. The lack of knowl- 
edge of the meanings and implications of these is outstanding and 
constitutes a challenge for further intensive work in this direction. 

Mention of individual scales such as the Binet-Simon, De Sanctis, 
Rossolimo Scale, and of Educational Tests, Personality and Character 
Tests, and Vocational and Trade Tests, gives the book an appearance 
of completeness, but the description of each is so brief that none has 
much value. 

The Manual is the product of much work. Its greatest value lies 
in the compilation of so many individual tests which previously have 
been available with difficulty or even entirely unavailable. Alleged 
original sources of a very great many of the tests are given, and 
frequent and valuable references to literature regarding them are 
included. A very ample and well selected bibliography of 319 titles 
completes the book. 

The Manual is a valuable contribution to workers with mental 
tests. It will be a reference book for those more casually interested in 
individual tests; it will be an indispensable handbook for clinical 
psychologists and for teachers of courses in mental testing; and it will 
be a valuable text for sections of advanced courses in mental tests in 
which emphasis is placed upon clinical examinations. 


EpwWIN Mavricer BalILor. 
Dartmouth College. 
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An ELEMENTARY SURVEY OF PsycHOLOGY FOR LAYMEN 


Psychology—A Simplification, by L. R. Coleman, and Saxe Commins. 
New York: Boni and Liveright, 1927. Pp. 320. 


One is at the first surprised to find no preface, introduction, or 
foreword of any kind. The omission is to be regretted, for failure to 
observe the usual convention of an introductory statement of the 
authors’ purposes and viewpoint may constitute a bias against it and 
may discourage the more careful examination which the book really 
merits. 

The book offers no results from authors’ researches and propounds 
the doctrine of no special school of psychology. It offers no special 
thesis. Instead it is a simple survey of many of the fields of element- 
ary psychology. It is comprised of two parts. The first consists of 
chapters on mental abnormality, measurement, heredity, the mind of 
the child, animal mind, analysis of mind, and so-called ‘‘Bypaths of 
the Mind”’ by which is meant such phenomena as hypnotism, multiple 
personality, dreams, mental telepathy, and the like. The chapters 
of the second part are devoted to Criminal Behavior, Psychology of 
Religion, Social Psychology, Applied Psychology, Schools of Psychol- 
ogy, and Speculative Psychology. As a guide to a first survey of 
elementary psychology the book has much to commend it. It is not, 
and does not pretend to be, comprehensive, but the facts presented are 
really numerous and have been well selected. Further, they are pre- 
sented in an interesting style. There is freedom from stiltedness, 
verbosity, and factual qualification, and there is frequent use of 
literary analogy and historical reference to accentuate the points dis- 
cussed. It is a thoroughly readable book. Well selected references, 
with apt and tart comment, which bear upon the topics of the respec- 
tive chapters are included. 

The organization of the book is less satisfactory. With few, if 
any, exceptions, the order of the chapters might, for example, be 
interchanged without modifying the present unity or coherence. Also, 
chapter organization seems to be determined more by verbal con- 
tinuity than by argument or chronology. They appear, therefore, 
more as somewhat disconnected and informal discourses upon the 
topics selected than logical chapters which develop or support a theme. 
They do possess an attractiveness which will make the book interesting 
to laymen. It may appeal to teachers of elementary courses as valu- 
able in tending to correct the impressions of students who feel too 
keenly the present emphasis upon analytical psychology or who need 
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collateral reading of reasonably sound psychology which is not only less 
formal but less divorced from every-day concepts and references. It 
is possible that the book could be used as a text in small classes in 
which instructors can take time to evaluate and to summarize the 
numerous and rambling facts presented. Its special contribution, 
however, is its popular presentation of some facts to laymen who wish 
to get a first and elementary survey of the facts and problems of 
elementary psychology. This is the first attempt of these young 
authors, and their success in this should justify a more ambitious 


second. EpwiIn Maovrice BaItor. 
Dartmouth College. 





AN INTERPRETATION OF PsycHIATRIC METHODS 


Manual of Psychiatry by A. J. Rosanoff (editor), de Fursac, Holling- 
worth, Jarrett, Neymann and Williams. Sixth edition, revised 
and enlarged. New York: John Wiley and Sons, Inc., 1927. 
Pp. 697. $6.00. 


The new edition of this standard work in psychiatry shows an 
interesting trend toward the further acceptance in this field of the 
concepts of scientific psychology. This is especially noticeable in the 
discussion of mental deficiency, which was treated under the somewhat 
dubious title of ‘“‘Arrests of Development” in the earlier editions. 
Although the so-called clinical varieties are still given a more promi- 
nent place than is justified by their frequency, the psychological and 
social aspects of feeble-mindedness are for the first time given the 
emphasis that they deserve. 

Other new inclusions which recommend the manual to non-medical 
psychologists are a chapter on the mental hygiene of childhood, and a 
glossary of psychiatric terms which is of great assistance to the lay 
reader. As in the last edition, complete instructions for the Stanford- 
Binet Test are included, and also the Kent-Rosanoff Free Association 
Test, with frequency tables. A table for computing intelligence 
quotients is a useful addition. 

It is by considering the Manual of Psychiatry as a whole, however, 
that psychologists are likely to gain most. A study of the most dis- 
torted conditions of mental life gives considerable insight into the 
normal mechanisms of adaptation, thought and character formation. 
The psychiatric insistence on the energic character of conflicts and 
adjustments has done much to mold the dynamic concepts of modern 


psychology. LAURANCE F.. SHAFFER. 
The Lincoln School of Teachers College. 
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SKILLS AS OBJECTIVES IN READING 


Silent Reading and Study Objectives and Principles, by J. A. Wiley, 
Cedar Falls, Iowa: Published and distributed by J. A. Wiley, 
Iowa State Teachers College. 1927. Pp. XIII + 308. 


It is Mr. Wiley’s theory that the approach to the field of silent 
reading and study must be made by laying out a definite curriculum of 
objectives. This he had done with a vengeance. An introductory 
chapter explains his theory. A second chapter states the objectives, 
ten in number. Mr. Wiley then launches into a detailed analysis of 
these objectives, each one receiving a chapter. There are ‘ general” 
objectives, subdivided into ‘‘main” objectives, which in turn are 
further “‘carried down”’ to “‘partial’’ or ‘‘temporary” objectives. It 
is so logically worked out that you can hear the machinery clanking. 
You cannot miss a point, either, for they are all in bold face type. 
The analysis is simplified, too, for every objective is a skill. Skill in 
getting rich experience, pleasure and liberal culture, in comprehending 
problems and other complex situations are a few of them. 

That is the logical approach. It is the same kind of reasoning 
that developed the ‘‘alphabet method” in teaching reading. Para- 
graphs are made up of sentences. Sentences are made up of words; 
words of letters. Therefore, teach the child letters so that he can 
build up words and sentences. Scientific investigation in eye move- 
ments has exploded that theory. Recognition that reading is a com- 
plex, integrated process and not an assembled group of specific abilities 
makes it impossible to accept this theory also. The attainment of 
skill in reaching ten or any other number of objectives will not train for 
reading in life situation where no one “objective” will ever be found 
by itself. In my judgment, a reading situation, except for diagnostic, 
remedial, or drill purposes, should be a life-like situation. That means 
that the objectives will be accounted for because of the need for them 
in the particular situation. It means also that reading experience will 
include objectives, not that reading experience will be given to reach 
objectives. The emphasis is, on the whole, integrated reading process 


not on specific goals. Louise C. KRUEGER. 
Browning School, New York City. 























